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Hidden Markov models (HMM) have enjoyed a rich history of successes over the past decades. 
They have been applied to great effect in almost any conceivable segmentation task, from 
speech recognition and part-of-speech tagging, over financial time series analysis, to seismology 
and beyond. In bio informatics, they are widely used for tasks such as gene finding, isochore 
classification and, most recently, detection of copy-number variation (CNV) in genomic data. 
Advances in biotechnology, such as high-resolution DNA microarrays and next-generation genome 
sequencing, have created data sets of millions and billions of values, presenting new challenges to 
the application of this classic. CNV detection from large genomic data sets is gaining momentum 
in research and diagnostics applications. As it often involves limited computational resources 
and time constraints, the importance of fast, accurate and low-memory approaches to HMM 
inference is obvious. 

As statistical models, HMM depend on a large number of parameters, which have to be either 
provided a priori by the user or inferred from the data. Classic frequentist maximum likelihood 
(ML) techniques like Baum-Welch (Bilmes 1998; Rabiner 1989; Rabiner & Juang 1986) are 
not guaranteed to be globally optimal, i. e. they can converge to the wrong parameter values, 
which can limit the accuracy of the segmentation. Furthermore, the Viterbi algorithm (Viterbi 
1967) only yields a single maximum a posteriori (MAP) segmentation given a parameter estimate 
(Forney 1973). Failure to consider the full set of possible parameters precludes alternative 



interpretations of the data, and makes it very difficult to derive p-values or confidence intervals. 
Furthermore, these frequentist techniques have come under increased scrutiny in the scientific 
community. 

Bayesian inference techniques for HMMs, in particular Forward-Backward Gibbs sampling 
(Chib 1996; Scott 2002), provide an alternative for CNV detection as well (Guha, Li & Neuberg 
2006; Shah, Xuan, et al. 2006; Shah, Lam, et al. 2007). Most importantly, they yield a complete 
probability distribution of copy numbers for each observation. As they are sampling-based, 
they are computationally expensive, which is problematic especially for high-resolution data. 
Though they are guaranteed to converge to the correct values under very mild assumptions, 
they tend to do so slowly, which can lead to over-segmentation and mislabeling if the sampler is 
stopped prematurely. The difficulties encountered in Bayesian HMM inference are the reason 
most research has focused on the ML approach (Ghahramani 2001). 

Another issue in practice is the need to specify hyperparameters for the prior distributions. 
Despite the theoretical advantage of making the inductive bias more explicit, this can be a major 
source of annoyance for the user. It is also hard to justify any choice of hyperparameters when 
insufficient domain knowledge is available. 

Recent work by Mahmud & Schliep (2011) has focused on accelerating Forward-Backward 
Gibbs sampling through the introduction of compressed HMMs and approximate sampling. For the 
first time, Bayesian inference could be performed at running times on par with classic maximum 
likelihood approaches. It was based on a greedy spatial clustering heuristic, which yielded a 
static compression of the data into blocks, and block-wise sampling. Despite its success, several 
important issues remain to be addressed. The blocks are fixed throughout the sampling and 
impose a structure that turns out to be too rigid in the presence of variances differing between 
CN states. The clustering heuristic relies on empirically derived parameters not supported by a 
theoretical analysis, which can lead to suboptimal clustering or overfitting. Also, the method 
cannot easily be generalized for multivariate data. Lastly, the implementation was primarily 
aimed at comparative analysis between the frequentist and Bayesian approach, as opposed to 
overall speed. 

To address these issues and make Bayesian CNV inference feasible even on a laptop, we 
propose the combination of HMMs with another popular signal processing technology: Haar 



wavelets have previously been used in CNV detection (Ben-Yaacov & Eldar 2008), mostly as a 
preprocessing tool for statistical downstream applications (Wang & Wang 2007; Hsu et al. 2005; 
Nguyen et al. 2007, 2010; Huang et al. 2008) or simply as a visual aid in GUI applications (Wang, 
Meza-Zepeda, et al. 2004; Autio et al. 2003). A combination of smoothing and segmentation 
has been suggested as likely to improve results (Lai et al. 2005), and here we show that this is 
indeed the case. Wavelets provide a theoretical foundation for a better, dynamic compression 
scheme for faster convergence and accelerated Bayesian sampling. We improve simultaneously 
upon the typically conflicting goals of accuracy and speed, because the wavelets allow summary 
treatment of “easy” CN calls in segments and focus computational effort on the “difficult” CN calls, 
dynamically and adaptively. This is in contrast to other computationally efficient tools, which 
often simplify the statistical model or use heuristics. The required data structure can be efficiently 
computed, incurs minimal overhead, and has a straightforward generalization for multivariate 
data. We further show how the wavelet transform yields a natural way to set hyperparameters 
automatically, with little input from the user. 

We implemented our method in a highly optimized end-user software, called HaMMLET. Aside 
from achieving an acceleration of several orders of magnitude, it exhibits significantly improved 
convergence behavior, has excellent precision and recall, and provides Bayesian inference within 
seconds even for large data sets. The accuracy and speed of HaMMLET also makes it an excellent 
choice for routine diagnostic use and large-scale re-analysis of legacy data. 

In Chapter 1, we describe structural genomic variations, the main biological and medical 
application that motivates our work, as well as the experimental platforms, the type of data they 
create, the computational challenges and previous approaches to solve them. Chapters 2 and 
3 review the main ingredients of our method, Hidden Markov Models and the Haar wavelet 
transform. Chapter 4 derives our method and investigates its properties, while presenting a 
proof-of-concept implementation as well as experimental evaluation. In Chapter 5, an improved 
implementation is presented to tackle genome-sized data, along with novel algorithms. In Chapter 
6 we apply our method to real-world data. Chapter 7 presents an outlook on promising research 
directions. 
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Chapter 1 

Introduction 


The main motivation behind the development of our method is the inference of genomic copy- 
number variation (CNV) from experimental data. In this chapter, we provide a short overview of 
the biological background, the role of CNV in human health, as well as the experimental and 
computational framework for their study. 

1.1 Structural variation in genomes 

As far as we can tell, all life on Earth, including humans, owes its origin, continued functioning 
and long-term survival to its genetic material, called the genome in its totality. This term refers to 
both its genetic information, as well as the physical medium of its storage. 

On the physical level, the genome consists of four nucleobases, separated into two groups: two 
purine bases (adenine and guanine, A and G) and two pyrimidine bases (cytosine and thymine, 
C and T). These are linked together in a sequential manner to a backbone of phosphate groups 
and the 5-carbon sugar deoxyribose into a single strand of deoxyribonucleic acid, or DNA for short. 
Each unit of phosphate, deoxyribose and nucleobase is called a nucleotide. The propensity of 
each group of nucleobases to form hydrogen bonds between complementary bases (A with T, C 
with G) leads to a phenomenon called hybridization, in which two strands of complementary 
base sequences align together, winding around each other in a double-stranded helix. A set of 
such helices called chromosomes carries the genetic information in organisms. 

On the information level, the genome uses the four-letter code A, C, G, and T (with U for 
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uracil replacing thymine in RNA molecules). Since the sugar and phosphate in each nucleotide 
is identical among nucleotides, the distinction between nucleotides and nucleobases is void 
from an information-theoretic standpoint, and both are called a base in general parlance. As 
DNA is typically double-stranded, a complementary base-pair is used as the basic unit of genetic 
information, similar to bit and byte. It is commonly abbreviated 1 as b, and used with SI prefixes, 
e.g. kilobase (kb) or megabase (Mb). This provides a convenient way to express the size of any 
genomic segment. 

The genetic information stored in a chromosome is very complex. The term genes typically 
refers to so-called protein-coding regions, i. e. sequences that are transcribed into mRNA, which 
serves as a template for protein synthesis, a process called gene expression. mRNA transcripts 
contain so-called untranslated regions on both ends (5’UTR and 3’UTR), which are involved 
in RNA splicing, transcription regulation and translation; though not part of the final protein, 
these sequences are genomically encoded and thus subject to changes in the chromosome as 
well. In more general terms, genes also include sequences that encode for ncRNAs (non-coding 
RNAs), such as transfer-RNA, ribosomal RNA, micro RNA et cetera. Regulatory elements, such 
as promoters, operators, transcription factor binding sites (TFBS), enhancers and silencers, are 
binding sites for proteins which regulate whether and to what extend a gene is expressed at any 
given time. Elucidation of information content is still an ongoing endeavor, but the plethora of 
functions described here may serve as a glimpse into the intricacy of information encoded in this 
simple strand-like molecule. 

The human genome is subject to a variety of physical processes which disrupt its architecture, 
commonly referred to as structural variants (SV). Chromosomes can be merged or split, have 
their parts rearranged, duplicated, deleted, inverted or shuffled (Fig. 1.1); a large variety of 
types and mechanisms has been studied, see Hastings et al. (2009). Since genetic information 
is encoded in the DNA sequence, any structural variation changes its content. Among the many 
types of SV, copy-number variants (CNV) and single-nucleotide polymorphisms (SNP) account for 
the majority of observed cases, and are relevant for the remainder of this thesis; other SV such 
as inversions and translocations are neither of interest in this study nor can they be inferred by 

1 Conventions vary, as b is sometimes used instead of nt (for nucleotide ) to denotes a single base, whereas bp 
denotes a basepair. 
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Figure 1.1: Overview of structural variations (SV) relative to a reference genome. Copy-number 
variants (CNV) detectable by the method presented in this thesis are indicated by gray background. 
Figure modified from Alkan, Coe & Eichler (2011), with permission from Macmillan Publishers 
Limited. 


the algorithm presented here in its current form. 

Copy-number variants (CNV) Copy number variants are defined as segments of DNA for which 
the number of occurrences within one individual’s genome differs from that of corresponding 
sequences in the reference genome. In genomics, the term copy-number variants is commonly 
used to refer to variants found in multiple individuals of a population, and is distinguished from 
copy-number aberrations, which are typically associated with germline or somatic changes that 
have adverse effects for the health of one specific individual (Shlien & Malkin 2009; Valsesia 
et al. 2013; Navin 2015). Since CNV and CNA are equivalent from a data analysis standpoint, 
we refer to both of them as CNV in this thesis. Aside from large-scale changes due to full or 
partial aneuploidy, i. e. abnormal counts of entire or partial chromosomes, CNV are comprised of 
insertions and deletions (indels , for short), as well as tandem repeats. In the more narrow sense, 
CNV refers to intermediate-size copy-number aberrations between 1 kb and 5 Mb (Freeman 
2006), which is small compared to the size of the data and thus challenging to find. Likewise, 
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indel often refers to much smaller events; this terminology historically stems from different 
experimental techniques and their resolution capacities. Unless otherwise stated, we shall refer 
to CNV as any kind of insertion or deletion of genomic material. 

Single-nucleotide polymorphisms (SNP) Single-nucleotide polymorphisms refer to exchanges, 
losses or gains of single base pairs. They are important in the study of population genetics and 
human ancestry. While not directly a target of our computational method, they play an important 
role in the measurement of allelic CNV in the form of SNP arrays (Section 1.3.1). 

Structural variations affect the information encoded on the DNA strands in a variety of 
ways; for a review, see Hurles, Dermitzakis & Tyler-Smith (2008). Insertions and deletions 
can disrupt a protein-coding sequence by introducing a premature stop codon, resulting in a 
truncated protein product. It can also create chimeric proteins, i. e. a molecule composed of 
two partial proteins. If the inserted sequence is not from a coding region, the resulting product 
would partially consist of a random amino acid sequence. Most of these changes will result in a 
loss of function for the affected gene, but especially chimeric proteins can result in coupling of 
regulatory pathways. Structural variants can also result in a translocation of regulatory elements 
such as promoters, which puts the affected gene under regulatory control of a pathway it would 
not usually be part of, thus disrupting the functioning of the cell on a systems level. CNV can 
also cause dosage effects, i. e. abnormal expression levels of gene products due to an abnormal 
number of genes. If SV occur in regions which do not code for genes, regulatory elements or other 
information such as ncRNA, and are not important for functions such as chromatin regulation, 
they may also have no effect at all. 

The resulting changes to the genome can have phenotypic consequences, ranging from 
adaptation to environmental factors to severe illness. Studying SV promises to yield important 
insights into genome function and causal relationships between the genetic information level on 
one side and disease and phenotype on the other. The experimental platforms used in both clinical 
diagnostics as well as basic and translational research generate large amounts of high-resolution 
data, posing a challenge from the bioinformatics standpoint. 
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1.2 Implications of SV for human phenotype and health 

Structural variations in genomes arise in two different forms. Germinal mutations occur in the 
germline, and usually affect all, or, in the case of mosaicism (Sturtevant 1929), some cells 
in an individual’s body. Somatic mutations on the other hand arise within individual cells of 
an organism during its lifetime. These can be neutral to cell function, or, in most cases, incur 
damage that renders the cell inviable. In rare cases however, the changes can give the cell a 
competitive advantage by making it both self-sufficient and independent of internal and external 
regulatory and apoptosis-inducing signals, leading to uncontrolled cell proliferation known as 
cancer. 

Structural variations have important biomedical implications, and are the focus of many 
genetic studies to elucidate the etiology of complex diseases, in particular cancer and neuropsy¬ 
chiatric disorders. 

1.2.1 Somatic SV and cancer 

It has been over a hundred years since an association between chromosome abnormalities and 
cancer has been established by Boveri (1914) as one of the most prominent features of cancer 
cells, encountered in all major types of tumors (Frohling & Dohner 2008). This has lead to 
the notion of cancer being a disease of the genome. With cancer causing 1 in 8 deaths globally 
(Garcia et al. 2007), it is no exaggeration to say that elucidation of their role can be considered 
one of the most important research objectives of our time. 

In cancer, somatic changes in copy number are one of the most noticeable changes in the 
affected cells (Nakagawa et al. 2015). Commonly referred to as copy-number aberrations 
(CNA), they are the target of diagnostic platforms such as array-based CGH (Section 1.3). While 
experimental evidence suggests a causative role for aneuploidy in tumorigenesis (Foijer, Draviam 
& Sorger 2008; Holland & Cleveland 2009), discerning causal CNV (driver mutations) from 
those that are mere byproducts of cancer progression (passenger mutations) is still an open 
research problem, which requires highly accurate calls of CNV from experimental platforms. For 
further details, Garraway & Lander (2013) has an excellent review of cancer genomics. 
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1.2.2 Germinal SV and human diversity 

With the plethora of evidence for somatic variations, their germinal counterparts have only 
recently been determined to occur in abundance in the human population, constituting a major 
contributor to human genetic variation. While SNPs were originally thought to account for the 
majority of phenotypic variation (The International SNP Map Working Group et al. 2001; 
The International HapMap Consortium 2005), a number of recent studies have shown that 
CNV show remarkable abundance within the human population, even among healthy individuals 
(Sebat, Lakshmi, Troge, et al. 2004; Iafrate et al. 2004; Tuzun et al. 2005; Sharp et al. 
2005; Fredman et al. 2004; Redon et al. 2006; de Vries et al. 2005; Schoumans et al. 2005; 
Sharp et al. 2005; Tyson et al. 2005; Conrad et al. 2006). Freeman (2006) contains an 
extensive history of CNV discovery. This discovery has huge implications outside the more obvious 
population genetics applications; the prevalence of CNV raises the issue of false negative calls in 
biomedical applications. Specifically, an accurate map of copy-number polymorphisms, stratified 
by donor ethnicity, is required as a true negative set when investigating disease associations of 
CNV. 

1.2.3 Neuropsychiatric disorders 

CNVs have been implicated in a variety of neuropsychiatric disorders (Malhotra & Sebat 2012; 
Cook Jr & Scherer 2008), in particular in autism (Chung, Tao & Tso 2014). 

Autism spectrum disorder (ASD) refers to a range of neurodevelopmental disorder traditionally 
characterized by the triad of impaired reciprocal social behavior, poor communication skills, 
and stereotyped or repetitive patterns of behavior, interests or activities Kanner (1943). It 
includes previously distinguished diagnoses of Asperger syndrome, PDD-NOS, and childhood 
disintegrative disorder. 

While ASD diagnosis is based entirely on behavioral observations due to the complex etiology 
of the phenomenon, it has long been recognized as having a strong genetic component (Folstein 
& Rosen-Sheidley 2001), but no single cause has yet been established. Early studies already 
showed microscopically visible chromosome anomalies in 7-8% of patients (Xu et al. 2004). 
With the advent of microarray technology (Section 1.3.1), a wide range of submicroscopic CNV 
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has been found to be associated with the ASD phenotype, and there is considerable support for 
a causative role of CNV. De novo occurrence of CNV has been reported as 3-5 times higher in 
affected families vs. control, and multiple de novo CNV have been found to lead to more severe 
symptoms. Furthermore, recurring CNV have been found in unrelated ASD patient. Lastly, CNV 
are enriched for genes associated with neuronal synaptic disorders, and overlap with those found 
in ADHD, SZ and ID; for an extensive review of those studies see Sener (2014). The role of 
CNVs in ASD and schizophrenia is reviewed in Merikangas, Corvin & Gallagher (2009). 

However, there has been a paradigm shift in recent years towards the notion that there is no 
single causative mutation underlying the disease. Instead, the autism spectrum is considered to 
be a shared phenotype of multiple rare but highly penetrant genetic risk factors. To elucidate the 
role of CNV in this context, accurate CNV calls on high-resolution genomic data are critical. 

Tourette syndrome (TS) is another disorder, characterized by motor tics, such as blinking, 
grimacing or head jerking, and phonic tics, such as coprolalia, palilalia and echolalia. It is often 
associated with obsessive-compulsive disorder (OCD) (Pauls et al. 1986) and attention-deficit 
hyperactivity disorder (ADHD). Though there is evidence for a genetic basis of the disease, it is 
not fully understood (O’Rourke et al. 2009) . Recently, a role of rare large-scale CNV in TS and 
its aetiological overlap has been suggested (Nag et al. 2013), thereby adding to the notion of a 
general association of CNV and neuropsychiatric disorders. 

1.3 Experimental platforms 

A wide range of experimental platforms of increasing resolution has been developed over the 
last years (Alkan, Coe & Eichler 2011), and analysis of the data they produce is the central 
motivation and application area for our method in bioinformatics. 

1.3.1 DNA microarrays 

A DNA microarray, sometimes called biochip or gene chip, consists of a carrier substrate such 
as glass, plastic or silicone. On its surface, probes of single-strand DNA are affixed in a regular 
grid, such that each probe consists of DNA material of a different, predefined DNA sequence. 
Each probe is capable of capturing single-stranded DNA material of a reverse-complementary 
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sequence through hybridization, thus forming a short double-stranded helix. In order to assess 
the presence and abundance of DNA material in a sample, the sample DNA is fragmented into 
short pieces, fluorescently labeled and hybridized to the array. Then, the light emitted from each 
fluorescent spot is measured as a proxy of the abundance of DNA bound to each probe. 

Probes are designed such that they represent known, preferably unique positions in the 
genome, so that the abundance of chromosomal segments can be measured. Microarrays yield 
very noisy data, due to the complexity of the biochemical process; furthermore, Gaussian noise 
is inevitable due to the quantum nature of photon emissions and its interaction with the light 
receptors in the equipment. 

Obviously, the resolution of this method is limited by how many positions in the reference 
genome are covered by the set of probes. More importantly, microarrays rely on a predefined 
set of probes, i. e. they cannot detect sequences for which no probe is used. If, for instance, the 
genome contained a novel sequence due to a retrovirus, or the sample was contaminated by 
another DNA source, this would go undetected. 

While there are many types of microarrays for different purposes, aCGH and SNP arrays are 
the most prominent types of platform for CNV detection. 

Array-based Comparative Genomic Hybridization (aCGH) As the name suggests, aCGH is 
used to compare relative DNA abundance between two genomes, such as cancer tissue versus 
control. The two cases are labeled in different colors such as red and green, and then hybridized 
to the same microarray. The photo detector is then used to measure the resulting wavelength as 
a proxy for relative DNA abundance; for instance, if both genomes have the same amount at a 
probe, the resulting color will be yellow, whereas a shift towards red or green would indicate 
higher abundance in the respective genome. aCGH thus measures the relative abundance, though 
typically the control is assumed to be diploid, so that absolute abundance in the target genome 
can be derived. Relative abundance is typically expressed as the log2 ratio of green and red 
signals. The process is illustrated in Fig. 1.2. 

It is important to note that since the chromosomes are fragmented, the only feasible way to 
position the measurements is via the reference genome used to design the probes. This means 
that the information obtained from aCGH is abundance of a chromosomal fragment, but not 
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Figure 1.2: An overview of array-based comparative genome hybridization (aCGH). Sample and 
control DNA are flourescently labeled (1), and hybridized on an array of spotted DNA probes of known 
sequence, yielding a differential color signal (2). Measuring those colors and mapping the signal to 
the genome positions corresponding to each probe (3) yields a piecewise-constant signal with noise 
(4), which can be used to detect CNV. Figure reproduced with kind permission from Cancer Genetics, 
Inc. (www.cancergenetics.com). 


its location. In the most extreme, hypothetical case that the entire genome had been severely 
shuffled and merged into a single chromosome, aCGH would not detect this. 

Single-nucleotide polymorphism arrays (SNP arrays) The normal human genome contains 
two copies of each autosome. However, those two copies are usually not completely identical, but 
carry different alleles of the same locus, a phenomenon called heterozygocity. Most importantly, 
they often carry SNPs at many positions. However, some chromosomal segments may lack such 
differences; they are homozygous. Especially in somatic mutations, loss of heterozygocity (LOH), 
i. e. the loss of one allele coupled with a duplication of the other, is known to be associated 
with or even a driver of a wide range of diseases. Since loci which are heterozygous only in 
a SNP differ by only one nucleotide, and hence can bind both to the same probe. Therefore, 
LOH cannot be detected by aCGH. To alleviate this restriction, SNP arrays contain carefully 
designed sets of probes for SNPs known to exist in the human population, which allows to resolve 
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allelic copy number. The data from SNP arrays is similar to that from aCGH, with the important 
distinction that it is 2-dimensional, with each channel corresponding to one of two alleles. Since 
probes for each SNP differ only by one nucleotide, the resulting cross-hybridization yields a lower 
signal-to-noise ration than aCGH, thus posing additional challenges for CNV inference. 

1.3.2 Next-generation sequencing 

DNA sequencing refers to determining the sequence of bases A, C, G and T in a DNA sample. Aside 
from more labor-intensive methods such as Sanger sequencing and Maxam-Gilbert sequencing, 
next-generation sequencing (NGS) techniques have enabled high-throughput sequencing of DNA 
samples, creating several GB of data per sample; for a review, see Metzker (2010). 

For NGS, DNA material is amplified and fragmented into short segments of several hundred 
bases. These fragments are then sequenced into short strings known as reads, ranging in size 
from 75-250 bp, depending on the platform. As the information about the original position of 
each read is lost during fragmentation, it has to be reconstructed computationally. In paired-end 
sequencing, the fragment is sequenced from both sides, yielding additional information about 
the relative distance of two reads. As this distance is limited by the fragment size, mate-pair 
sequencing uses an additional experimental step in which the two ends from a fragment originate 
from loci further apart in the genome than on the fragment. 

When analyzing structural variation, it is important to note that reads essentially provide two 
layers of information. The pure sequence information can be used to determine sequence-level 
changes such as inversions and rearrangements, and can largely be subsumed under an elaborate 
approximate string matching paradigm, where the major source of noise are low-frequency 
sequencing errors. SV calls can often be made on a per-read basis, and are thus algorithmically 
more “localized”. 

On the other hand, abundance of similar genomic sequences carries information about the 
number of copies of a genomic locus, which can be used for CNV detection. Two basic approaches 
exist: Read-depth (RD), sometimes referred to as depth of coverage (DOC) is determined by using 
a read mapper, essentially a highly optimized implementation of approximate string matching, 
to map reads to a known reference genome. After several post-processing steps to account for 
error sources such as amplification, GC and dinucleotide bias, the coverage at each nucleotide 
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Figure 1.3: Illustration of aCGH (left) and read-depth data (right) for CNV detection. Duplication in 
the sample genome increases shifts the intensity ratio and the number of reads mapped to the affected 
loci in the reference genome, and hence induces an upwards shift in the count signal (blue). Likewise, 
a loss incurs a downwards shift in the signal (red). 


in the reference is used as a proxy for DNA abundance at each position in the sample; for a 
review, see Magi et al. (2012). Read-depth data is illustrated in Fig. 1.3. De novo assembly 
(AS) on the other hand tries to merge the reads into contiguous strings called contigs, in which 
paired-end/mate-pair information as well as the multiplicity of occurrence of reads again used as 
a proxy for DNA abundance, but does not necessary require a reference genome. Both approaches 
require a more global look at the data, in that they should model the general noise level, for 
instance. 

Whole-genome sequencing (WGS) The most straightforward application of NGS is the se¬ 
quencing of the entire genomic material in a sample, known as Whole-genome sequencing (WGS). 
For reviews of this technique, see Pirooznia, Goes & Zandi (2015), Pabinger et al. (2014), 
and Hehir-Kwa, Pfundt & Veltman (2015). As it is not a targeted tool and yields information 
about every genomic locus regardless of the information it carries, it is the method of choice for 
fundamental and exploratory research, in particular for cancer (Nakagawa et al. 2015). 

Whole-exome sequencing (WES) Due to the relatively high cost of WGS, targeted sequencing 
methods are often used, in which a predefined subset of DNA fragments is selected using 
target-enrichment techniques such as PCR, molecular inversion probes, in-solution capture, or 
hybridization to a microarray (hybrid capture). This reduces the amount of DNA to be sequenced 
significantly, especially in cases where only the protein-coding regions of the genome, the so- 
called exome is targeted. This whole-exome sequencing (WES), see Kadalayil et al. (2015), 
Tetreault et al. (2015), and Hehir-Kwa, Pfundt & Veltman (2015), leads to a 100-fold 
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reduction in sequencing for the human genome. While the reduced cost makes WES attractive 
for CNV detection, target enrichment introduces a considerable bias in read depth, which is not 
easily corrected computationally (Kadalayil et al. 2015). Though various approaches exist, we 
argue that the expected drop in sequencing cost makes it more worthwhile to focus method 
development on larger, but statistically more well-behaved data such as WGS. 

1.4 Algorithmic challenges and prior approaches 

Microarray-based methods will continue to play an important role in diagnostic settings due to 
their cost-effectiveness. The expected increase in resolution and low concordance among analytical 
tools (Pinto et al. 2011) necessitate the development of efficient bioinformatics tools for unbiased 
CNV calls on these platforms. However, the complexity of cancer and neuropsychiatric genomes 
necessitates even higher resolution that can only reasonably be achieved through sequencing 
approaches. In clinical settings, exome sequencing currently provides a compromise between 
cost and resolution, but the complex etiology of both classes of disease would require the use of 
WGS. It has been suggested that most driver genes, i. e. genes that are causal to tumorigenesis, 
have been identified (Nakagawa et al. 2015; Vogelstein et al. 2013; Lawrence et al. 2014), 
while rarer gene mutations and those in non-exomic regions such as promoters, ncRNAs and 
introns remain to be investigated (Leiserson et al. 2014; Garraway & Lander 2013). For 
instance, Freedman et al. (2011) suggests a role for regulatory elements in carcinogenesis. As a 
consequence, whole-genome approaches are likely to become more important in fundamental 
research (Garraway & Lander 2013), as the plethora of CNV calling methods for WGS data, 
reviewed in Pirooznia, Goes & Zandi (2015), Abel & Duncavage (2013), Zhao et al. (2013), 
Pabinger et al. (2014), and Duan et al. (2013), shows. While sequencing cost can be expected 
to drop, the computational power required to perform meaningful bioinformatics analysis is still 
beyond the capacities of most clinical institutions (Chung, Tao & Tso 2014), thus posing a huge 
challenge for the computer scientist. 

While different platforms generate data of different characteristics, which in turn necessitate 
statistical and computational approaches tailored specifically to them (Wineinger et al. 2008), 
any experimental technology has its half-life, and generalized models that can treat data specifics 
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under a common framework are more likely to advance the field in the long run. As data 
created by microarrays and NGS for CNV detection can adequately be modeled as piecewise 
constant with additive noise (see Fig. 1.3), time-proven methods such as Hidden Markov Models 
(Baum & Petrie 1966) have repeatedly been applied in the context of CNV detection (Snijders, 
Fridlyand, et al. 2003; Sebat, Lakshmi, Troge, et al. 2004; Sebat, Lakshmi, Malhotra, et al. 
2007; Fridlyand et al. 2004; Zhao 2004; de Vries et al. 2005; Nannya et al. 2005; Marioni, 
Thorne & Tavare 2006; Korbel et al. 2007; Cahan et al. 2008; Rueda & Diaz-Uriarte 2009). 
They are favorable from a modeling standpoint, as they directly express the separate layers of 
observed measurements, such as log-ratios in array comparative genomic hybridization (aCGH), 
and their corresponding latent copy number (CN) states, as well as the underlying linear structure 
of segments. At the same time, advances in experimental technology create ever larger data sets, 
implying the need to leverage statistical analysis to handle big data. In this thesis, we shall hence 
take a high-level approach to CNV detection, and focus on improving and accelerating the type 
of Hidden Markov models suitable for this kind of data. 
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Chapter 2 

Hidden Markov Models 


In this chapter, we review the basics of HMM inference. We use the following notation: boldface 
(y) denotes some ordered data type which allows indexing, such as a vector or an array. y[i] 
denotes the i-th element of y; all indices are zero-based, y[i][j] denotes the element in the 
i-th row and j-th column in a two-dimensional data type. Slices/ranges are denoted as y[i: j] 
with the j-th entry included (closed interval). In contrast, y, denotes the i-th element out of an 
(unordered) collection {y 0 ,y!,...}. 

2.1 Probability theory 

Let the probability space (Q, S, P) be a measure space, where Q is called the sample space, S is 
the set of events, and P is called probability measure. Further, let P (fi) = 1. A random variable is 
a function 

X: (Q, S, P) -> (K, B{ K), P x ) 

where K typically is M, Z, C,_ X is E-6(K)-measurable, i. e. V# € K): X -1 (£) € S, which 

implies 

P x (B) = P (X _1 (B)) = P ({to e n |X(co) e B}). 

Further, let V x (v) be a function 

P X :R-»[0,1], V x {x)p{ dx) = l, 

Jk 
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called the density of X. The integral is the Lebesgue integral with an appropriate measure /i on 
K. This function can be used to define the probability measure on (K, B(K), F x ) as the integral 

IPx(^) : = 'PxM^dx). 

Ja 

Since we will be dealing almost exclusively with simple densities and probability mass 
functions over subsets of M n and Z n , we follow the notational convention common in machine 
learning to blur the distinction between events (X = x), random variables (X), and their 
realizations/variates (x). We denote both the conditional density V{X\Y = y) as well as the 
likelihood function V{X = x | y) as V[X \ Y), and it is derived from context which of X, Y remain 
fixed, whenever such information is necessary. Occasionally, the likelihood is denoted as C{Y |X). 
Consequently, there is no logical distinction between lower- and upper-case notation. Furthermore, 
the distinction between parameters and random variables is meaningless in a Bayesian context 
due to the inversion principle, see for example Robert (2007). We also follow the convention to 
denote the fact that a random variable X is distributed according to a distribution P depending 
on a parameter 6 as X ~ V(X \ 6). We also simplify the integral notation to 


7\(x)M(dx):=l P(X)dX. 

In general, since the domain is often clear from context, we will drop it from the integral whenever 
possible, especially when integrating out dimensions, e.g. 

V{X,Y)dY = V(X). 

J 

Let 

E[X] := XdP = X(co)P(da>) = XP(X)dX 

J £1 J fi J 

be the expected value of random variable X, or the expectation for short. For continuous and 
discrete univariate random variables, this becomes 


E [X] = XP(X)A(dX)= XP(X)dX, and 


—oo 
oo 


r oo 

E [X] = XP(X)#(dX)= ^ 

JZ X=— oo 
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with Lebesgue measure X and counting measure #, respectively. More generally, for any function 
/, we define the expectation 


E x mX,Y)] 


f{X,Y)V{X,Y) dX, 


as well as the conditional expectation 


E x [f(X,Y)\Z] 


f(X,Y)V(X,Y\Z)6X. 


By the law of total expectation, 


E[X]=E y [E x [X\Y]]. 

Random variables X, Y are said to be independent, denoted X1.Y, iff 


V{X,Y) = V{X)V{Y), 


and conditionally independent, denoted XLY\Z iff 


V{X,Y\Z) = V{X\Z)V{Y\Z). 


Linearity of expectation holds for any set of random variables X h i. e. 





Similarly, the variance is defined as 


Y[X] :=E[(X-E[X]) 2 ] = E[X 2 ]-E[X] 2 . 


By Bienayme’s formula, for a sum of uncorrelated (and hence independent) random variables, 


Y 



i 


A Bayes net (BN) is a directed acyclic graph which contains an edge from B to A if A is conditioned 
on B, see Fig. 2.2 for an example. For a random variable X, let 


pa(X) :={A\V(X\A)^V(X)}, 

ch(X) :={A\V{A\X)^V (A)}, and 

co(X) := {A|C echX,iP(C|A) ^P(C),A#X}. 
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The quantities papf), chpf) and co(X) are called the parents, children, and coparents of X 
respectively, referring to the relationships of notes in a Bayes net. Let 

d X := pa(X) U co(X) U chpf) 
be the Markov blanket of X. Then 

X 1 8X\ ({I}U5I) C . 

The Markov blanket plays an important role in Gibbs sampling, since sampling of a random 
variable only requires conditioning on the Markov blanket as opposed to the entire set of variables 
in the model. An undirected graph is called a Markov Random Field (MRF) over a set of random 
variables if, for every variables, the set of its adjacent nodes equals the Markov blanket of said 
node. 


2.2 Bayesian inference and the exponential family 

Bayes’ theorem can be easily derived in the following equivalent transformations 


V{M,D) = V{D,M), 


V{M\D)V{D) = V{D\M)V{M), and 
V{D | M)P(M) 


P(M|D): 


V{D) 


If D is constant, then V (D) simply serves as a normalization factor for the left side to integrate 
to 1. Also, V{D |M) then becomes a function of M, not D, and does not represent a density 
since it does not necessarily integrate to 1. To emphasize this fact, V(D \ M) is often denoted as 
£ (M | D), yielding Bayes’ theorem in the form of 


V{M | D) oc M £ (M \D)V (M). 


Here, V(M) is called the prior distribution, £ (M | D ) is called the likelihood function, and P(M | D ) 
is called the posterior distribution; this captures the essence of Bayesian inference. If D represents 
some observed data, and M encodes our model of the data, the prior captures our model before 
the observation of the data, either from past observation or due to some assumptions (inductive 



18 


bias). The posterior represents the model after observing the data, and is proportional to the 
product of the prior and the probability of observing the data under the prior model. This process 
can be iterated, i. e. the posterior can become the prior for some new observations, hence the 
model gets more refined the more data is incorporated. In other words, the probability of the 
observed data under the current model becomes the likelihood of the model in light of the 
observed data. 

Assume the model consists of a probability distribution fully described by some parameter 9, 
and the data consists of some variates x t ~ V{x \ 6), so we obtain 

P(0|x)oc£(0|x)P(0). 

Further, assume the prior is fully defined by some hyperparameter i. For the joint distribution of 
data, parameter and hyperparameter, we thus have 

V(G,x,t) = V(x,9,'z) 

4=> P(0 \x,'i)V{x,'i) = V{x | 0,t)P(0,t) 

= P(x|0,t)P(0|t)P(t). 

Assuming the hyperparameter does not influence the data directly, we have x _L z \ 6 and hence 
V{x | 6, t) = V{x | 0), so we obtain 

V{6\x,z) oc q £(0 |x)P(0 |t)P(t). 

The hyperparameter itself is kept constant, since it encodes our prior belief of the parameter 
model, hence 

W I*,t) oc e £(0 \x)V{0 |t). 

Let an exponential family distribution (EFD) be any distribution for which the PDF is of the 
form 

V(x | 0) = exp ((0, T (x)) + h (be) — A(0)), 

where 0 is a parameter vector 1 , A is a scaling factor called the log-partition function, h is called 
the carrier measure, and T are the sufficient statistics. A vector T(X ) is called a sufficient statistic 

1 The usual definition uses the more general form tj (0) for the parameters, but here we always assume the 
canonical form 7)(9) = 6 for simplicity, since any EFD can be transformed accordingly. 
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of a random sample X, if it contains all information about a distribution parameter that can be 
obtained from the complete sample. Formally, 

0 ix | t(x). 


or, equivalently, 

P(0 |T(X),X) = P(0 |T(X)). 

By the Neyman-Fisher factorization theorem (Fisher 1922; Neyman 1936; Halmos & Savage 
1949), a distribution has sufficient statistics if and only if its density function can be factored as 


V{x\ 0) = H{x)g{T{x,0), 


i. e. it depends on x only via T. Obviously, EFD permit this factorization; furthermore, among all 
distributions for which the domain does not vary with the parameter, only EFD have sufficient 
statistics whose size remains bounded with increasing sample size (Darmois 1935; Koopman 
1936; Pitman, Wishart & Fisher 1936). 

For a sample of i.i.d. variates {x 1 }^ =1 , the likelihood function of an EFD becomes 

N N 

C{e | (x f }? =1 ) = Y\w l*,0 = Y\exp((T(x i ie) + Kx i )-Am ( 2 . 1 ) 

i=l i=l 


ex P ( r (*;)> 9 ) + M>;)j ~NA(0) j (2.2) 

ex P T ( x i)> 9 ^ + ^ h (Xi) -NA(0) j (2.3) 


by linearity of the inner product in its first argument. It follows that the likelihood of a sample 
can be computed in a summary fashion, using only a fixed number of values in the sufficient 
statistics, independent of sample size N. For each EFD, there exists a conjugate prior distribution 
(Diaconis & Ylvisaker 1979) of the form 


V{6 \ T,n) oc exp((T, 6) — nA(9)). 

Dropping normalization constants for simplicity, the posterior can then be derived as 

V{6 | ,t) oc £(0 | {xjf =1 )p(0 | t) 

-cxpffe T(Xi), G \ -NA(0)1 exp((r, 0) - nA(0)) 
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Figure 2.1: Visualization of a simple 3-state HMM for CNV detection. The hidden state path q is a 
Markov chain over the states {+, —, =}, representing a gain, loss, and diploid state, respectively. Upon 
visiting a state, a sample from its emission distribution (represented as gray densities next to each 
state) is drawn, according to the emission parameters 0 = {0 + ,0_, 9 = }. In this case, emissions come 
from Gaussian distributions with different means and variances. 


exTp^yj(x i ),9^l + (t, 0)-lVA(fl)-nA(fl)j 
= exp ^ t + T(Xi), 9 ^ - (n + N)A(0 ) j . 


Hence for posterior updates, it holds that 


V\ 9 


N \ 

T + ^T{Xi),n + N oc£(0| | T, n). 


V ! = 1 J 

Conjugacy thus implies that the posterior is of the same analytical form as the prior. Its count 
parameter n is updated by simply adding the sample size, and its hyperparameter t is updated by 
adding the sufficient statistics of the observed sample. Since more observations can be included 
iteratively without changing the distribution type of the posterior, we say that a posterior is strong 
if its count parameter n is large. 


2.3 The model itself 

Let T be the length of the observation sequence. An HMM can be represented as a statistical 
model (q,A,9,n\y), with transition matrix A, a latent state sequence 


q =(q[0],q[l],...,q[r-l]). 
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Figure 2.2: Left: Bayes net representation of an HMM. Arrows indicate conditional relations, e.g. 
V (q [0] | n, ,4). Center: Bayes net representation of the HMM by treating all emissions y [t] and all 
latent state variables q[t] as joint probability vectors y and q. Right: Markov Random Field (MRF) 
representation of the blocked HMM, obtained by adding the moralizing edges (0,q) and (re, A). The 
Markov blanket of a node corresponds to its direct neighbors in the MRF. Shading indicates that the 
variables are observed. 


an observed emission sequence y = (y[0],y[l], ...,y[T— 1 ]), emission parameters 9 , and an 
initial state distribution n. The vector 0 = (0,,..., 9 p ) parametrizes the emission distributions 
depending on the underlying state, i. e. 

V(y[t]\0,q) = v{y[t]\9 q[t] ). 

The state sequence is modeled as a first-order Markov process with 

V(q[t] = j \ q[t-l] = i) =: A tj 

To derive conditional independence relations, the following Markov blankets can be easily read 
off (Fig. 2.2): 


dy[t] = {q[tl6}, 

(2.4) 

dq[0] = {n,y[0],q[l],A,G}, 

(2.5) 

dq[t> 0] = {q[t-l],q[t + l],y[t],A,6}, 

(2.6) 

d n = {q[0],M}, 

(2.7) 

d A = {q, n), and 

(2.8) 

d0 = {y,q}. 

(2.9) 


Collapsing emissions and state sequence into one multivariate random variable each (Fig. 2.2) 
by merging nodes while maintaining adjacencies yields 


dy = {q,0} 
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d n = {q,A }, and 
dq = {y,0,n,A}. 

Note that this operation, while yielding a more convenient notation, induces unnecessary condi¬ 
tional relations, for instance within q, which are understood to be ignored whenever appropriate. 

Since we assume the hyperparameters to be constant, we drop dependence on them whenever 
appropriate for notational clarity, i. e. we define V{x) := V{x \ x x ). The total distribution of an 
HMM then factorizes as 

V{y,q,0,n,A) = V{y\0,q)V{q\n,A)V(0)V{A)V{n) 

T -1 

= V(d)HA)V(n)nq[0]\7i,A)nymq[0ie)Y\r(y[t]\q[t],e)Hq[t]\q[t-l],A). 

t =1 

From the MRF representation, it is easy to see 

V[y\d,q,A, n) = r{y\q0), 
so the likelihood function for an HMM with observed data y is 

£(0|y)= f V(y\q,6)dq. 


2.4 Segmentation using HMM 

Solving the segmentation task using HMM is based upon q . There are two main approaches. 
For a fixed set of parameters, the most likely state sequence can be determined using the Viterbi 
decoding algorithm; such a state sequence is known as the Viterbi path. Alternatively, segmentation 
by maximum posterior marginals (MPM) chooses the most likely state at each position t. Notice 
that these two are not necessarily the same; indeed, a sequence of maximum state margins 
can have very low likelihood, or even be an impossible state sequence if it contains transitions 
for which A contains zero-entries. It does, however, have the advantage of integrating over all 
possible state paths. 

The approaches central to our method are described in the following subsections. For 
notational clarity, all conditioning on HMM parameters is dropped wherever possible. 
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2.4.1 Filtering 


Filtering describes the probability of being in a certain state at position t, given all observations 
up to this point, i. e. 

V(q[t]\y[0:t]). 


Naively, for k states, that would require the computation of k t+1 state sequences, an unreasonably 
effort. However, a simple recursion based on the fact that state sequences can share a common 
prefix is used to derive a dynamic programming approach called forward algorithm. By Bayes’ 
formula, 


i p (q[t]|y[t],y[0:t-i]) = 


P(y[t]|y[0:t-i]) 


V(q[t]\y[0:t- 1]). 


Since 


y[t]ly[0:t—1] \q[t ] 


by the Markov property, we obtain 

a t [j] :=P(q[t] = j|y[0:t]) oc ; - T(y[t] \q[t] = j)T(q[t] = j |y[0:t-l]). 


This term is called the forward variable 2 . The distribution of the last term, 


V(q[t]\y[0:t- 1]), 


sometimes called the one-step-ahead predictive density (Murphy 2012), can easily be expressed 
as 


V(q[t] = j\y[0:t-l]) = ^ i V(q[t]=j\q[t-l] = i)V(q[t-l] = i\y[0:t-l]) 

i 

= Y^ Ai i a t- itt- 

i 

The forward variables can thus be recursively defined as 

a t [j] oc p(y[t] | 


2 Note that in other derivations, the forward variable refers to the joint density P{q[t],y[0:t]) instead. In those 
settings, normalization to the conditional version used here is treated as a means to obtain numerical stability instead. 
The two versions are equivalent from a theoretical standpoint. 
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All forward variables for an HMM with k states can hence be recursively computed using a T x k 
dynamic programming matrix T called a trellis, using 

T[j][t ] oc p(y[t] | O^AijUat - 1] 

i 

with subsequent normalization of columns T[ t ] to sum to 1 in order to obtain the conditional 
state distribution. The forward recursion can be expressed compactly as 

a t oc a t _i) ® £ t , 

where £ t is the vector of emission likelihoods at position t and © is the Hadamard product. An 
implementation of this recursion has time complexity 0{k 2 T). 

2.4.2 Smoothing 

In order to obtain maximum state marginals, the distributions 

Tt -=V(q[t]\y) 

are derived from the forward variables 


V{q[t]\y[0:t]). 

Using the fact that 

V{A\B,C) = ^^- °c a V{A,B\C), 
the marginal state probability decomposes as 

V(q[t]\y)oc q[t] r(y[t+1:T-Ilq[t]\y[0:t]) (2.10) 

= V(q[t]\y[0:t])V(y[t+l:T-l]\q[t],y[0:t]-) (2.11) 

= V(q[t]\y[0:t])V(y[t+l:T-l]\q[t]), (2.12) 

since y[0:t] _L y[t+l:T—1] | q[t]). Let the backward variable 

lit[i]-=V{y[t+l:T-m\q[t] = i), 

then 


r t \i] oc a t [i]0 t [i]. 
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The backward variable can be recursively defined as 

li t -i=V(y[t+l:T-l]\q[t]) (2.13) 

= Y i V(q[t]=j,y[t],y[t+l:T-l]\q[t-l]) (2.14) 

j 

= 2 7 :, ^[t+ 1 : T’- 1 ]lq[t] = ; ) y[tLq[t-i])^(q[t] = j ) y[t]lq[t-i]). (2.15) 

j 

Since y[t+l:T-l] 1 (q[t-l],y[t]) | q[t], 

lit- 1 = J]v(y[t+l:T-l]\q[t] = i)V{q[t ] = i,y[t] |q[t-l]) (2.16) 

j 

= '^ l ‘P(yU+l:T-l]\q[t]=j)P(y[t]\q[t]=j,q[t-l])P{q[t]=j\q[t-l]). 

j 

(2.17) 

Sincey[t] lq[t-l] | q[t], 

/3 t _ 1 [i] = ^P(y[t + l:r-l]|q[t]=j)P(y[f]|q[t] = i)P(q[f] = ilq[t-l] = 0 (2.18) 

i 

= Y l Pt\jW\Aj (2.19) 

j 

The backward recursion can be expressed compactly as 

15 1 = »4(^t+i ® fit+ 1)- 

2.5 Frequentist inference and its caveats 

Smoothing requires the values of the latent HMM parameters to be known; however, in most 
practical applications, only y is observed directly. In the usual frequentist approach, the state 
sequence q is inferred by first finding a maximum likelihood estimate of the parameters, 

(Ail,#ml,^ml) = arg max C{A,6,n\y), 

( A,d,n ) 

using the Baum-Welsh algorithm (Bilmes 1998; Rabiner 1989). This is only guaranteed to yield 
local optima, as the likelihood function is not convex. Repeated random reinitialization are used 
to find “good” local optima, but there are no guarantees for this method. Then, the most likely 
state sequence given those parameters, 

q = argmaxP(q I Al> 0 ml> ^ML,y), 

<2 
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is calculated using the Viterbi algorithm (Viterbi 1967). However, if there are only a few 
aberrations, i. e. there is imbalance between classes, the ML parameters tend to overfit the normal 
state which is likely to yield incorrect segmentation (Mahmud & Schliep 2011). Especially 
in CNV inference, where the rare, non-diploid states are more informative, this is problematic. 
Furthermore, alternative segmentations given those parameters are also ignored, as are the ones 
for alternative parameters. 

2.6 Bayesian inference 

The Bayesian approach does not assume ( A , 0, n) to be known. Instead, a joint prior distribution 

V{q,A,0,n |t) 

parametrized by t = (t^, t g, t„) captures our subjective believe about the values of the latent 
variables (q,A,9, 7t); notice that the distribution of q is solely defined by ( A , n, 9), see Fig. 2.2, 
and hence does not require a hyperparameter. Upon observing data y, the joint posterior is 

V{q,A, d,n\y,T) oc £(q,A,G,n\y)V(q,A,0,n\T). 

The distribution of state sequences is computed directly by integrating out the emission and 
transition variables, 

V(q,A, 0, n |y, t) dnd6 dA. (2.20) 

Since this integral is intractable, it has to be approximated using Markov Chain Monte Carlo 
techniques, i. e. drawing N samples, 

(q «, A®, 0 (i) , rc (0 ) ~V(q,A,6,n\y,x), (2.21) 

and subsequently approximating marginal state probabilities by their frequency in the sample 

1 N 

P(q[t] =s |y,T)^ — ^I(q W [t] =s). (2.22) 

i =1 

Thus, for each position t, we get a complete probability distribution over the possible states, 
solving the smoothing problem in a Bayesian setting. As before, this yields an MPM estimate 
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which can be used for segmentation. Notice that MCMC does not allow for the Viterbi path to be 
computed. 

The method of choice used in this thesis combines Gibbs sampling of the HMM with forward- 
backward sampling of the state sequence, yielding an MCMC inference algorithm known as 
Forward-Backward Gibbs sampling (FBG). 

2.6.1 Gibbs sampling 

As the marginals of each variable are explicitly defined by conditioning on the other variables, an 
HMM lends itself to Gibbs sampling, i. e. repeatedly sampling from the marginals (A \ q, 6,y, n), 
(0 | q,A,y, n), ( n \ A, 0,y,q), and ( q \ A, 0,y, n), conditioned on the previously sampled values. 
Using Bayes’ formula and conditional independence relations in Eq. (2.5)-Eq. (2.9), the sampling 
process can be written as 


A~T(A\n,q,T A ) 

oc C{A\n,q)V{A\r A ), 

(2.23) 

6 ~P(0 |q,y,T fl ) 

oc £(0 \ q,y)'P(0 | tq), 

(2.24) 

n ~V(n\A,q,T n ) 

oc C(n \A,q)V(n \ t„), and 

(2.25) 

q ~P(q \A,y,6,n) 


(2.26) 


where t x represents hyperparameters to the prior distribution V(x \ t x ). Typically, each prior 
will be conjugate, which yields 


A~V{A\'t A {n,q)), 

(2.27) 

6 ~V{6 |T e (q,y)), 

(2.28) 

Tr~P(;r|T 7I (M,q)), and 

(2.29) 

q ~P(q \A,y,9,n ). 

(2.30) 


Notice that the state sequence does not depend on any prior. The sampling of parameters is 
straightforward using their conjugate priors. 

2.6.2 Forward-Backward sampling 

There are several schemes available to sample q |y. A direct Gibbs sampling approach would use 


'P(q[t]\q[->t],y) = V(q[t]\q[t-l],q[t+l],y[t]) 
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to sample a state based on its neighbors ([-it] means all positions except t). This, however, yields 
high autocorrelation and slow mixing properties; instead, Scott (2002) has argued strongly 
in favor of forward-backward sampling (Chib 1996), also known as forward filtering, backward 
sampling (Murphy 2012). Variations of this approach have been implemented for segmentation 
of aCGH data before (Mahmud & Schliep 2011; Shah, Xuan, et al. 2006). Consider the 
following chain rule factorization, 

T—1 

r(q\y) = nq[T-l]\y)Y\r(q[t]\q[t+l:T-l],y) 

t= l 

T—1 

= V(q[T-l]\y)Y\r(q[t]\q[t + l],y) 

t=i 

T -1 

= V(q[T-l]\y)Y]r(q[t]\q[t+l],y[0:t]), 

t= l 

where the first step follows from the Markov property of q, and the second step follows from the 
conditional independence relation 


q[0:t]±y[t+l:T-l]|q[t+l]. 


By Bayes’ formula, 

'P{q[t]\q[t+l],y[0:t]) oc q[t] P(q[t+1] |q[t],y[0:t])P(q[t] |y[0:t]), 


and, since 


q[t+l]ly[0:t] |q[t], 


this becomes 


V{q[t] |q[t+l],y[0:t]) oc q[t] T(q[t+1] \q[t])T(q[t] |y[0:t]). 

Hence, we can recursively sample q | y in a backward fashion, using a trellis of forward variables 
and the backward-sampling recursion 

'P(q[t] = i\q[t+l]=j,y[0:t]) oc i A iJ a t [i], 

with subsequent normalization. It can be computed within the trellis by updating the columns as 


r[i][t]^r[i][t]A, q[ t+i] 
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Figure 2.3: 

A small trellis f 

or a simple HMM for CNV detection with t 

hree states {—, = 

, +} and T = 8, 



yielding a 3 x 8 matrix. Entries are represented by circles. After an iteration of forward filtering, it 
contains the forward variables for each position and state. Relations used in the forward recursion are 
indicated by arrows; here, each forward variable depends on all previous forward variables through the 
transition matrix A. After applying the backward algorithm, it contains the marginal state probabilities. 
If instead the backward sampling algorithm is used, it contains the sampling weight to obtain a state 
sequence variate q, shown here by filled circles. 


and sampling q[t] using the resulting weights, yielding a time complexity of 0[k 2 T ). The 
algorithm is illustrated in Fig. 2.3. The combination of this algorithm with the Gibbs sampling of 
the entire HMM is called Forward-Backward Gibbs sampling (FBG). 


2.7 Compressed Hidden Markov Models 

Forward-backward sampling of the state sequence, despite its various advantages over direct 
Gibbs (Scott 2002), is still expensive for genome-sized data. Firstly, notice that the forward 
variables need to be available for all states and data positions during Gibbs sampling; though 
this memoization in a dynamic programming table, called a trellis in the context of HMM, avoids 
exponential running times for the recursions shown above, it can grow to a considerable size. 
Secondly, since in each iteration a number of terms quadratic in the number of latent states has 
to be calculated at each position to obtain the forward variables, and a state has to be sampled at 
each position in the backward step, running times become impractical, especially due to billions 
of calls to a random number generator. 

To alleviate these problems, Mahmud & Schliep (2011) have recently introduced compressed 
FBG for Gaussian emissions by sampling over a shorter sequence of sufficient statistics of data 
segments which are likely to come from the same underlying state, thereby decreasing the 
size of the trellis by a constant compression factor, at the cost of approximate sampling. Let 
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Figure 2.4: A compressed version of the trellis in Fig. 2.3 into 5 blocks. Notice that the number of 
recursive relations (arrows) is reduced, as all emissions within each block are assumed to have been 
generated by the same state. Notice that the sampled state sequence cannot change between states 
within a block. 


B := (B W )J^ =1 be a partition of y into W blocks. Each block B w contains n w elements. Let y[w][k] 
the k-th element in B w . The forward variable a w (j) for this block needs to take into account the 
n w emissions, the transitions into state j, and the n w — 1 self-transitions, which yields 




"■w 

( J ) B w)X Snd 


i =1 


H-w 

£(/r,cr 2 |B W ) = f” [p/Acr 2 |y[w][k]). 
k = l 

This compression is illustrated in Fig. 2.4. The ideal block structure would correspond to the actual, 
unknown segmentation of the data. Any subdivision thereof would decrease the compression 
ratio, and thus the speedup, but still allow for recovery of the true breakpoints. In addition, such 
a segmentation would yield sufficient statistics for the likelihood computation that corresponds 
to the true parameters of the state generating a segment. Since the block structure is the target of 
the inference, the authors use a heuristic motivated by kd -trees to create a static block structure. 
The authors showed that the approximation error is small under reasonable state separation 
assumptions. In such a setting, the emitting state likelihood dominates the forward variables, 
whereas that of other states are close to zero, leading to negligible alternative state paths. This 
allows for an approximation of marginal probabilities, and MPM segmentation. 
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Chapter 3 

Wavelet Transform 


Our method uses the Haar wavelet transform to dynamically denoise and compress the input 
data, allowing the Gibbs sampler to operate on different resolution levels. In this chapter, we 
briefly review wavelet theory, closely following the exposition of Mallat (2009). 


3.1 Multiresolution analysis and wavelets 

Let (n,A, /i) be a measure space, where Q is the base set, A is a a -algebra over Q, and /i is a 
measure on A. Let 


L p (Q,A,/i) :=|/ :0 —> K,KG {M,C},/measurable, J |/(x)| p d/i(x) < oo 
be the space of p-integrable functions. Let 


L 2 (M) := L 2 (M,£(M), A) 


be the space of square-integrable functions over the reals with Borel algebra B(R ) and the 
Lebesgue measure A, and likewise for subintervals of M. This is a Hilbert space with inner product 

( f,g)-= f /(x)g(x)dA(x) = f /(x)g(x)dx. 

Jr J—oo 

We are only concerned with functions over subsets of M, so the inner product commutes without 
involving the complex conjugate. The inner product induces the norm 


11/II := yfUJ)- 
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Two functions /, g are said to be orthogonal iff (/, g) = 0, and a function / is called normal iff 
||/1| = 1. In signal processing terms, square-integrability 


\fM\ 2 dx = (/,/) < oo 


means that / has finite energy. L 2 (R) is a separable Hilbert space, i. e. it can be spanned by 
an orthonormal basis, a set of basis functions v, and a set of coefficients a ; such that, for each 


/ e L 2 (R), 

+oo 

/(t)= ^ cqv/t) with a i = (f,v i ), 

i=—OQ 

Vi / j : (v i; v } ) = 0 (orthogonality), and 


Vi : ||v,-1| = 1 (normality). 


The goal of multiresolution analysis (MRA) (Mallat 1989; Meyer & Salinger 1992) is to 
provide a sequence of well-behaved, nested subspaces, which allow to approximate a function 
/ e L 2 (R) at different levels of resolution. The differences between those resolution levels will 
be captured by wavelet functions. An example for the Haar wavelet is shown in Fig. 3.1. Consider 
an approximation space Yj c L 2 (R), with resolution 2 -; , and its inverse 2 ] , called scale. For any 
function / € L 2 (R), its projection P Y .f onto Yj provides a lower-resolution approximation. Let 
Yj be translation-invariant for integer multiples of scale 2 J , i. e. 

/(t)eVj«/(t-2 , ()eV j , j,k& Z. 


Further, let a family of such spaces Yj be nested such that V j+ i c Yj, so that lim^oo Yj = {0} 
and lim^-oo Y } = L 2 (R). This nesting is done in such a way that the spaces are dilated versions 
of each other. More precisely, 

/(OsV^/QjeV^. 


Each Yj is spanned by an orthonormal set of scaling functions 4>jy, which are obtained by 
translation and dilation of a certain function cp e L 2 (R) by 


4> 


j,k •= 


V27 




t-k 
2 i 


, j,k e Z. 


Hence, due to orthonormality, the approximating projection of / is obtained as 


+oo 


P Vjf = J] (•■ 


k=—o o 
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The nesting of function spaces Vj c V ; .| means that certain approximation details are lost 
between resolution levels, since any function in Y j can be expressed as a member of V ( _ ], but 
not the other way round. To capture those details, let W ; - be the orthogonal complement of Vj 
with respect to Vj-i, i. e. 

Vj_i = Y j ® W j, 


which implies 

v/=V +i V- 


This allows for the decomposition 

i 

L 2 (R)=V ; ® 0 W j 

j=—OQ 

for any i e Z, as well as a decomposition free of scaling functions, 

OO 

L 2 (R)= 0 Wj. 

j=—OQ 

The detail spaces W ; - are spanned by an orthonormal basis of wavelet functions xp jk , which 
are derived from a mother wavelet xp € L 2 (R) by translation and dilations, 


i’i.k 




t — 2 J k 


2J 


j ' k '~ V 2 J 

For more details on the properties of and the derivation of xp through conjugate mirror filters, 
see Mallat (2009). 

Equipped with those definitions, any function / e L 2 (R) can be expressed as a linear 
combination of scaling functions and wavelets, 


00 i 00 

/(t)= 5 ] C i,k 4 ’i,k(t)+ 5 ] 2 d J,fcV'j,fcCO- 

k=— 00 j——o o k=—oo 

Here, c ; are called scale coefficients, and dj k are called detail coefficients or wavelet coefficients. 
They can be expressed as inner products 


Cj,k = (<t>j,k,f), and 
dj,k = {ipj,k,f)- 


The set of those coefficients is called the discrete wavelet transform (DWT). 
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bullets, the lines represent the real-valued functions 4’j, k - Each wavelet function has a central discontinuity b± k , which is connected here by a vertical line, as well 
as a left and right discontinuity b+ k and bj k . For instance, 6+ 1 = 4, bt; 1 = 6, and b~ 1 = 8. Notice the nested, tree-like structure of this basis. Right: Projections of 
/ = sin (| t) (top) onto different approximation spaces V-. For each subplot on the right, the basis elements are those in the subplots at and above that level on 
the left, e. g. the basis for V 2 comprises W 3 , W 4 , and V 4 . Notice that for integer coordinates P v f = f, i. e. the discrete wavelet transform (DWT) is lossless for 
equidistant samples / of /. 
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In computational applications, a physical signal / would often require the measurement, 
storage and processing of infinitely many values, unless the signal is discrete in nature. Instead, 
a vector / is obtained by sampling / at equidistant intervals. Let t 2 := L 2 (N, VfN), #) be the L 2 
space over the natural numbers with counting measure #. It is also a separable Hilbert space, 
and can be treated as the discrete analogue of L 2 , where 

(f,g)= f (/[t],g[t])d#(t) = 2 ]/[t]g[t]. 

Jn teN 

Notice that we prefer this notation to / ■ g to emphasize its connection to the function space. 
The discrete versions of the wavelets are then obtained as '■= V’/,fc(0 for teN. 

Further, assume finitely many sampling points such that / e M T . For the moment, assume 
T is a power of 2; data with T 2 n can be treated as truncation of longer data, which can be 
generated using various padding methods (Strang & Nguyen 1997). The wavelet basis is then 
finite, and forms the rows of a matrix TV e M rxr , such that Wy yields the vector of wavelet 
coefficients. This is called the discrete-time wavelet transform (DTWT), though DWT is often used, 
as any software implementation is by necessity discrete. Since the wavelet basis is orthonormal, 
W is orthogonal, i. e. W _1 = W T . Notice that the order of rows in W is unspecified, and hence 
different permutations of the coefficient vector are possible. Surprisingly, there are linear-time 
algorithms to calculate Wy (Sweldens 1995; Mallat 1989). 

3.2 Wavelet regression 

One of the main applications of the discrete-time wavelet transform is in functional regression, 
specifically in a method known as wavelet thresholding or shrinkage. Let the observed signal be a 
sampling of a function / corrupted by centered Gaussian noise, i. e. 

y = / + e, e[t] ~ u . d . N(0, cr 2 ). 

The goal of regression is to find an estimate / which approximates / using the available data 
y. The quality of regression is usually discussed in the framework of decision theory. Before we 
discuss wavelet regression, we review some central concepts and results. The exposition follows 
Robert (2007) and Mallat (2009), with terminology following the former. An overview of 
relevant concepts is shown in Fig. 3.2. 
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Figure 3.2: An overview of decision theory for regression. Concepts aligned with a frequentist 
framework are shown in red, and Bayesian in blue. 


3.2.1 Decision theory for regression operators 

Let F be a set of functions such that / eF, and F be a distribution over IF. Let 5 be a decision 
operator, i. e. a regression method for estimating / := <5(y). Let L(5(y),/) be a non-negative 
loss function which quantifies the error for estimating / by /. The frequentist risk, or risk for 
short, quantifies the expected loss 

i?(S,/):=E y [L(S(y),/)|/]= f L(5,f)V(y\f)dy (3.1) 

over all data instances for a given /. 

In order to summarize the risk across all instances of /, there are two principle ways. If its 
prior distribution F is known, the integrated risk is obtained as the expectation with respect to F 
as 

R(5,F):=E f [R(5,f)] = f R(5,/)W)d/. (3.2) 

The operator 5 j- achieving the Bayes risk 1 

R( F) := R(5j r, F) = infR(<5, F) (3.3) 

_ 5 

1 Naming conventions vary among authors: Mallat (2009) calls the posterior risk the Bayes risk, and that achieved 
by 5jr is called the minimum Bayes risk. 
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is called a Bayes estimator. On the other hand, if such a prior is not known, the maximum risk 

R(5 ,¥) := sup R(5,f). 

f SF 

is the largest risk incurred by any / in F. An operator 5 F achieving the minimax risk 

R(¥) := R(5 F ,F) = infR(5,F), 

5 

is called a minimax estimator. It is easy to show that the minimax risk is an upper bound to the 
Bayes risk, i. e. 

Ken < mi 

since including prior information on the distribution of / can only improve the estimation. By 
the minimax theorem (Wald 1945; Komiya 1988; Sion 1958) minimax estimations are Bayes 
estimations for a least favorable prior, i. e. a prior distribution T a achieving the minimax risk, 

R(F) = R(.F 0 ). 

Though being a very conservative criterion, minimaxity tries to capture the notion of an optimal 
estimator. For instance, the sample mean is minimax for estimating the mean of i.i.d. Gaussian 
random variables, and maximum likelihood estimates for parametric settings are asymptotically 
locally minimax (see Donoho, Johnstone, et al. (1995) for a detailed discussion). 

The question remains as to how a Bayes estimator can be derived. Notice that the above 
definitions are based on a frequentist interpretation of risk, as it is defined for a known / and 
integrated over all possible variates y, even though / is latent and only one realization of y is 
observed. It is therefore more natural to take the Bayesian approach: the posterior expected loss 
or posterior risk 

p(5,/|y):=E / [L(5,/)|y]= f L{5,f)V{f |y)d/ (3.4) 

quantifies the expected loss for the given data y across all true functions / ~ T with respect 
to a prior distribution T. The Bayes estimator can then be obtained by selecting the optimal 
estimator for each y individually as 


<5^(y) '■= arginfp(<5,/ |y), 

O 
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since 


R(5,F) 




p(5,f |y)P(y)dy. 


This is of particular interest if the loss function is quadratic, i. e. 


L(5(y),/)=||5(y)-/|| 2 . 


(3.5) 


In this case, a Bayes estimator for each y corresponds to the mean of the posterior distribution, 
the posterior expectation 

^) = -[/W = J/WlrW. 

Expectation holds component-wise, 


Vt: f[t] = E[/[t]|y], 


(3.7) 


see Mallat (2009, Theorem 11.1). This fact will be used later in this thesis to connect Hidden 
Markov Models to wavelet regression. Other loss functions yield different statistics of the 
posterior distribution. For instance, under certain conditions (Bassett & Deride 2018), the 
Bayes estimator for the 0-1 loss 




0 11/ — /II < e 


/-/ 


> e 


e > 0, converges to the posterior mode, more commonly called the maximum a posteriori (MAP) 
estimate, for e —> 0. 


3.2.2 Minimaxity of wavelet thresholding 

In regression settings, it is often impossible in practice to formulate a prior signal distribution T, 
and the Bayes estimator (Eq. (3.7)) is generally unattainable. For instance, no practical statistical 
model exists to accurately describe photographic images. Therefore, other criteria, such as 
minimaxity with respect to a signal class IF, are usually employed to select a regression method 
(i admissibility being another popular criterion). Unfortunately, deriving minimax estimators is 
often hard, and they tend to be hard to compute. However, good approximations can sometimes 
be obtained. Wavelets in particular allow for a very simple approximation of minimaxity. 
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Consider a diagonal estimator 5 in which each detail coefficient dj k is attenuated by a factor 
a ; - fc, which depends on dj k itself. The risk is then 


R(5,f) = E [||/ - 5(y)|| 2 ] = 2X| {^j,k,f)-aj, k ^)[ 2 ] 

j,k 


Obviously, this attenuation requires knowledge of /, and is therefore not attainable, unless an 
oracle for / is provided. For theoretical purposes, consider the existence of an oracle having 

f 

knowledge of/. The oracle attenuator 5^ tt minimizes the risk by setting 

IWw)l 2 

Hk Ife,/)f+- 2 ' 

f 

Restricting the attenuation coefficients to {0,1} yields the oracle projector o ])r for which the best 
risk is obtained by 


1 


Hk = 1 




(3.8) 


[0 \(^j,k,f)\<o- 

In Donoho & Johnstone (1994), this method is referred to as selective wavelet reconstruction. It 
can be shown that the oracle projection risk comes close to the optimal attenuation, since 


li^/) <*(«£,,/) <*(«',/), 

and hence 

R(s£ r ,/)<2R(s{ tt ,/). 

Though the oracle is inaccessible, oracle estimators can be approximated surprisingly well. 
Let the universal threshold be 

X u := V21nTcr. (3.9) 


The hard-thresholding estimator <5 th is a projector using 

r 


1 


Hk = 1 


o 


\{^j, k ,y)\>H 
\{^j,k,y)\ <H 


which achieves a risk of 


(3.10) 


R(5 th , /) < (2 In T + 1) (cr 2 + R ($£,/)) 


(3.11) 
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for T > 4; see Mallat (2009, Theorem 11.7) for details. 5 th is optimal in the sense that the 
risk cannot be improved by any other diagonal estimator. It is also asymptotically minimax over 
a wide range of smoothness classes (Donoho & Johnstone 1995). Most importantly for this 
thesis, it is asymptotically minimax over piecewise a-Lipschitz functions, including the case where 
F is the set of piecewise-constant functions. 

Using universal thresholding for wavelet regression, also called wavelet shrinkage, has a very 
intuitive explanation: Following Donoho & Johnstone (1994), a wavelet is said to have m 
vanishing moments if 

= 0, 0 < i < m,p scalar. 

It follows that ip is orthogonal to any polynomial / of degree less than m, since 

/m—1 \ m—1 

(Sp'»f)=E<uU=o- 

\ i =i / ;=i 

This property is called polynomial suppression (Mallat 2009) . It follows that the detail coefficients 
in any wavelet decomposition of such a polynomial / are all zero, as it can be expressed entirely 
in terms of the scaling functions. Furthermore, any function in C m is well approximated by 
a (m — l)-degree Taylor polynomial p v about point v over a finite interval [v —h,v + h]. Let 
/ =P V + e v . Then 

(fAj,k) = (e v Ai,k), 

so the wavelet coefficient is negligible, as it only measures the Taylor approximation error, which 
is bounded by the Lipschitz coefficient of / as 

|e v (t)| < K\t — v\ a , K > 0, and m = [ctj < n. 

Since the wavelet transform is linear, it acts on the signal and noise component independently, 
i. e. 

Wy = W(f + e) = Wf + We. 

The central idea in wavelet shrinkage is that the detail coefficients dj k = (/, ipjy) are 0 if / is 
polynomial over the entire support of ipj due to polynomial suppression, or vanishingly small 
if / is locally m-times differentiable. Furthermore, due to orthogonality of W, We is again a 
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random vector of i.i.d. random variables distributed as 1V(0, cr 2 ), so the noise is maintained under 
the wavelet transform. In general, orthogonal maps preserve the L 2 norm, so 

|| We|| = || e || and ||Wy|| = ||y||. 

It follows that for piecewise polynomial functions with only a few discontinuities, most signal 
coefficients (/,'*/’ j,k) are zero due to polynomial suppression, hence 

for most j,k, i. e. most wavelet coefficients arise entirely from noise. As the noise component of 
the data is preserved as a Gaussian vector of variance cr 2 under the orthogonal transformation 
W, most of the T detail coefficients are themselves i.i.d. Gaussian random variables. The idea is 
then to find a way to create a vector w by setting a suitable set of coefficients in Wf to zero, 
and then use the inverse wavelet transform as a regression / := W T w. Ideally, these would be 
those exactly those noise coefficients. As a filtering criterion, the universal threshold can thus be 
interpreted as the expected maximum deviation of T such Gaussian random variables, which is 
at most X u := V 2 In Ta by Cramer-Chernoff’s method (Massart 2003). A simple derivation of 
this result has appeared multiple times in the mathematical vernacular: 

Proposition 3.2.1 (Expected maximum of Gaussian random variables). Let {Xi)J a sequence of 
T centered i.i.d Gaussian random variables with variance cr 2 . Let Z := maXjAj Then, 

E[Z] < V2In T<j 


Proof. Using Jensen’s inequality, 

T 

exp (tE [Z]) < E [exp (tZ)] = E j^exp ^t maxX ; j j = E J^maxexp (tXj)j < ^E [exp (tX ; )]. 

i=l 

The last term can be expressed a product of Gaussian moment-generating functions 

, cr 2 


with ;tq = 0, hence 


exp| tfi+t — 


exp (tE [Z]) < T exp t 


, cr 
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and 


The minimum is attained at 



□ 


The same upper bound can be shown for Z := max, : |X ; |, though this expectation will generally 
be larger. Furthermore, the maximum noise coefficient is just below X u with high probability 
(Berman 1992): 

lim rf max \(il> ik ,e)\ € A„ — a ^ —— , A„ 1 = 1. 
oo Vo 71 L lnl JJ 


3.3 The Haar wavelet 


The HMM treated in this thesis generate piecewise constant data with noise, albeit with potentially 
different noise variances for each state. As the underlying sequence of means is piecewise linear, 
we can base our method on the simplest, piecewise constant form of wavelets. This decomposition 
goes back over a century to the work of Haar (1910), and is widely recognized as the first example 
of a wavelet transform, long before the broader theory was established. Let the Haar scaling 
function be 

r 

1 0 < t < 1 

0(0 : = < 

0 elsewhere. 

The Haar wavelet constructed from that scaling function is 

1 0 < t < \ 

0(0 := - -1 \ < t < 1 

0 elsewhere, 
v 

As before, the basis elements are defined as 



where has non-zero support over the interval [2 J k, 2 J (k + 1)). The Haar wavelet has one 
vanishing moment, so in general it is orthogonal only to constant functions. 
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The discrete-time Haar wavelet transform Wy for a vector y of size T consists of the scaling 
coefficient c 0>0 as well as the detail coefficients d )jJt , j G {1,..., Id T}, k G {o,..., |j — l}. The 
rows of W consist of all ipjj, as well as (j ). Let 

b+ fc :=2 ; fc, b± k := 2 1 , and bj k :=2 J (k + l) 

be the position after the left, central and right discontinuity, respectively. Let 

ht ■- {f’L.^ -1} Rj k := .... br k -1} 

denote the index sets of the left (positive) and right (negative) support interval of j., i. e. > 

0 for t G Lj k and 1 < 0 for t G Rj k . Note that is zero outside of these index sets, so 
we ignore those entries and only refer to L UK as the support of the wavelet. The wavelet basis is 
illustrated in Fig. 3.1. 

W.l.o.g., let k = 0, ipj := ip ^ 0 , L ; := L ] (j , and Rj := K j 0 . Recall that for X, ~ lV(/tj, af) and 
s t € {—1,1}, the term 

~ N J] af cr 2 j, 

hence 

( e > ^j )=^ fS ] - S1~ N (°> ct2 ) • 

yteij teRj J 

As expected, the noise is preserved under the Haar wavelet transform. 
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Chapter 4 

Haar Wavelet Compression 


In this chapter, we introduce the core methodological contribution of this thesis, using the 
established theory introduced in the previous chapters. We show that the expectation of posterior 
marginal state means in homoscedastic Gaussian Hidden Markov Model (HMM) is a Bayes 
estimator under prior parameters z, and its risk is limited by the minimax risk of Haar wavelet 
regression. Using the positions of discontinuities as boundaries, blocks created by Haar wavelet 
compression preserve the approximate positions of true state transitions, and the probability 
of under-compression decreases with the separation of state means. Furthermore, we derive 
a dynamic compression scheme for heteroscedastic HMM, and implement it using a wavelet 
tree data structure. We provide an evaluation of our method on simulated and biomedical 
data, showing significant improvements in speed and convergence behavior over uncompressed 
Forward-Backward Gibbs sampling (FBG). Finally, we derive bounds on the bias incurred on 
the forward variables by compression for general HMM, and show that this tends to bias the 
forward-backward sampling procedure towards the maximum posterior marginal state within 
each compression block. We conjecture that our software HaMMLET produces an approximation 
of the maximum posterior margins (MPM) segmentation (see Section 2.4). 

The central approach of this thesis is to integrate Haar wavelet regression with Forward- 
Backward Gibbs sampling, in order to derive approximate state marginals and Bayesian smoothing 
for Hidden Markov models. Significant speedup and memory savings can be obtained by data 
compression in which the discontinuities in the piecewise constant function /, obtained from 
Haar wavelet regression towards the sequence / of emission means of the true state sequence q, 
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are used to define boundaries for data compression blocks: 

Definition 4.0.1 (Haar wavelet compression). Let y be a vector of input values, and d some vector 
ofHaar wavelet coefficients of the same dimensionality (typically the result of some regression method 
on Wy, such as wavelet shrinkage). Let y be partitioned into blocks of sufficient statistics such that 
a block starts at position t if and only if there exists some j, k such that d kk 7 ^ 0 and at least one of 
bj k , b± k or b~ k is equal to t (see p. 43). In other words, the data is compressed into blocks defined 
by the discontinuities in W T d. 

Notice that only projection operators, i. e. methods which set coefficients to zero, yield 
compression into blocks of at least two position, since any non-zero coefficient d l} will introduce 
a discontinuity, and hence a compression block boundary bf :, and in many cases at is left and 
right discontinuities bt and b~f. as well. Therefore, only projectors should be considered in a 
compression setting, and among them, thresholding methods achieve the best risk, with the 
universal threshold approaching the minimax risk. 

4.1 Homoscedastic HMM 

This proposal is based on the observation that HMM inference and regression methods deal with 
similar input data. Consider our observed data y to consist of a piecewise-constant data vector of 
means, /, corrupted by additive noise e, which is typically Gaussian. In a regression setting, the 
noise is considered to be homoscedastic, i. e. of uniform variance, while the number of distinct 
values in / is not limited a priori. On the other hand, for data generated by a Gaussian HMM, 
the number of distinct means is limited by the number of states, while there may be different 
noise variances associated with each component. The obvious intersection case is data generated 
by an HMM which has the same finite emission variance in each component, which yields a 
homoscedastic random vector y: 

Definition 4.1.1 (Homoscedastic HMM (cr-HMM)). We call a Gaussian Hidden Markov model a 
homoscedastic HMM of variance a 2 , or cr-HMM, iff 0 ; = (/r ; , cr 2 ). State labels can be identified 
with the emission means, so that the state sequence q e { 1 , 2 ,... } T can be written as the discrete 
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sampling 

f '■= (Mq[l]) • • • »Mq[r]) e • • • } T 

of a piecewise constant function / € F such that 

Vte{l,...,r} :/[t] = /(/), and 
Vi€[t,t + l):/(x):=/[t]. 

A ( 7-HMM thus defines a probability distribution over F. 

Since we can easily convert between / and /, and minimaxity results for wavelet regression 
apply to discrete samplings /of/, we forgo the distinction between M 7 ' and F, and we refer 
to the vector / as a piecewise-constant function. Hence, on the one hand, data generated by a 
cr-HMM can be treated as a piecewise constant function with i.i.d. Gaussian noise, where / can 
be approximated by regression. On the other hand, / is a hidden Markov chain which has an 
explicit distribution given the HMM parameters, allowing for an explicit formulation of Bayes risk. 
Thus, the observed data y can be treated both within a regression as well as an HMM framework, 
allowing for an integration of wavelets and FBG. 

As / is piecewise constant, the Haar wavelet with its one vanishing moment is the obvious 

candidate for regression. The emission variance cr 2 can be estimated directly using the standard 

approach of taking the median absolute value of finest detail coefficients |d lfc | , see Mallat 

(2009, p. 565) for details. Having the smallest support of 2, very few wavelets will contain 

a discontinuity in / in their support. Since d l k ~ lV(0,cr 2 ), this corresponds to the median 

absolute deviation from /i = 0, and hence 

medi, Idi J V r . ,\2 

4,-1 (|) J =(0.6745med|d 1 , i |) , 

where 4> is the PDF of the standard normal distribution IV(0,1). Using the universal threshold 
v 7 2 In T cr MAU (Donoho & Johnstone 1994), with high probability, the resulting estimate / th 
will set to zero all coefficients of wavelets with support over ranges that do not contain state 
transitions, and hence have no discontinuity in /. Conversely, it will keep the coefficients of 
those wavelets which have state transitions within their support. This creates discontinuities in 
the regression result around state transitions. We will develop the connection to HMM in the 



sections below. 
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4.2 Connection between HMM and wavelets 

In this section, we relate wavelet shrinkage to HMM smoothing. Since cr-HMM define distributions 
over F, we define two Bayes estimators based on the posterior state marginals. Under quadratic 
loss, these are obtained by position-wise posterior means, conditioned on the observed data y. 
We argue that changes in one of these estimators reflect strong changes in the posterior state 
distribution. Since minimax estimators are Bayes estimators for least favorable priors, Haar 
wavelet regression appears as a limiting case, and block boundaries can be expected to occur 
around changes in the marginal state distribution. We also discuss how wavelet compression 
might bias the Gibbs sampler towards the posterior parameters, yielding faster convergence. 

Definition 4.2.1 (Smoothing estimator). Let H be a cr-HMM with the hidden Markov chain 

M(f):=V{f\A,9,n ) 


over the state space {/ij. Let 


Tj(0 '■='P(q[t] = j\y,0,A, n) 

he the marginal probability to be in state j at position t, as obtained by HMM smoothing via the 
forward-backward algorithm (Section 2.4.2). Then we call the estimator f M := 5 M (y) with 

?x[f] 

i 

the smoothing estimator for f with respect to H. 

An example for this estimator is shown in Fig. 4.1. Notice how f M provides a summary of 
the behavior of the marginal state distributions, and can therefore be used as an indicator of the 
loci at which the marginal state probabilities change: 

Proposition 4.2.1. Given data y, the change in the smoothing estimator between positions t and u 
is 

k -1 

f M \- f ] -7m^ u ~\ = X Oofr) - io-oo) ~ 

j =i 

and that difference between any two positions scales linearly with the separation of means. 



Marginal state probability 
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Figure 4.1: Smoothing estimator / M {y) (top figure, red curve) for a state sequence / (black curve), 
and emission values y (gray dots) generated by a 3-state ct-HMM with known parameters. Wavelet 
thresholding / th (green curve) is the Bayes estimator that minimizes the risk incurred under the least 
favorable parametrization of a cr-HMM. The bottom figure shows the true marginal state distributions 
at each position. White vertical lines indicate the position of discontinuities in / th and hence the 
compression block boundaries. 


Proof. Let dj = Yj(u) — y -(t) be the change in marginal state probability for state j, and jj : = 
r,(0- Since ^ =1 7j(0 = Ej=i 7,0) = 1, we have 

-/mM = 

Em + (' ■-Enjf*,} - (Ste + (i - (Jr,) - g«;))*) 

fc—1 \ fk- 1 \ 7 k-1 \ 7 k-1 \ fk- 1 \ 7 fc-1 

2 7jMjJ + Pk ~ I J] YjPk J ~ I J] TjPj J ~ IJ] djUj J -p k + IX YjPkJ + IX 

k-i A 7 k-i \ k-i 

X d i^ MX d Oi = X d i 0** - Mj) • 

i=i y Vi=i y i=i 
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Further, w.l.o.g. let /ij < /r 2 < • • • < Mfc- Then p k — Pj = {l^k ~ Hj \ for k > j, and for any scaling 
factor a e R. of mean separation, 

fc-i k- 1 

^dja\ix k -nj\ = a^dj\n k -Hj\ = a(f M [t]-f M [u]). 
j=i j =i 

□ 

Notice that shifts in / v , are small if means are not well separated, changes in marginal state 
probabilities are small, or in rare cases, the vector of changes dj is orthogonal to that of means 
in,. Conversely, large mean separation or large state probability changes increase shifts in f M . 

Having obtained a regression operator which captures significant changes in the marginal 
state distribution, we now relate it to wavelet regression using the framework of Bayesian decision 
theory: 

Proposition 4.2.2. 5 M is a Bayes estimator for f with respect to prior V (/ | A,n,0) under 
quadratic loss \\f - 5 M (y )|| 2 . 

Proof. For a quadratic loss function, any Bayes estimator equals the mean of the posterior 
distribution given y (Eq. (3.7)). Conditioning the hidden Markov chain M(f) on the observed 
data y yields the posterior 

M(f\y) = V(f\A,6,n,y), 

i. e. the state sequence distribution in the HMM for which M is the latent state process. Then, 
/atIT] = E[/[t] | A,B,n,y] = J f[t]T(f \A, 6, n,y)df[t], 

and therefore 

j i 

attains the Bayes risk R(M). □ 

While a cr-HMM plays the role of a prior distribution on F in the decision-theoretic sense, we 
can extend the argument to Bayesian cr-HMM with a true prior for the model: 

Proposition 4.2.3 (HMM estimator). Let 5 T (y) with 


/ T [t] : =E[/[t] \y,z] 
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be called HMM estimator for f. This is a Bayes estimator for a prior distribution J 7 = V{f \ y,T) 
over the set of piecewise constant functions F under quadratic loss. Its risk under a least favorable 
parametrization M 0 := M(f [ t d ) is bounded by the minimax risk over F, asymptotically attained 
by Haar wavelet regression, 

R(M 0 )<R( F). 

Proof This follows directly from the results in Section 3.2. Since the parameters to M them¬ 
selves are subject to priors of their own, it is obvious that V{f |y,T), analogous to V{q |y,T) 
(Eq. (2.20)), defines a prior distribution over F in the decision-theoretic sense, with all prior 
information about the HMM encoded in t. Under quadratic loss, the Bayes estimator for each y 
is obtained as 

J f[t]'P(f[t]\y,Tt)df[t] = E[f[t]\y,z]. 

□ 

Notice that this estimator is hard to compute for the same reasons that we use MCMC to sample 
the HMM posterior in the first place. It only serves to justify the use of Haar wavelet compression. 
However, in analogy to Eq. (2.22), it could be easily approximated using Forward-Backward 
Gibbs sampling, as 



Since both estimators / th and / T are bounded by the minimax risk, and wavelet thresholding 
typically yields a good approximation of/, we expect ||/ t h — /tII to be small. Consequently, we 
expect the discontinuities of the wavelet regression / th to roughly correspond to changes in the 
posterior state distribution of a Bayesian HMM. Hence, by virtue of being a minimax estimator of 
/, wavelet shrinkage introduces information about the true posterior state marginals into the 
sampling process before the posterior distribution is ever available to the sampler. We believe 
this makes for a strong case to use those discontinuities as block boundaries for compression. 

FBG samples from f | y,T) only once the underlying Markov chain has converged. While 
we argue that wavelet compression is compatible with a cr-HMM in the sense that changes in its 
value correspond to large changes in posterior state distribution, it is worthwhile to investigate 
whether a similar statement can be made about the burn-in phase using the smoothing estimator, 
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which, by virtue of being Bayes, is bounded by the minimax risk as well. Specifically, one might 
be inclined to conclude that we can expect ||/ th — ImW t0 t> e sma ll> i n analogy to ||/ th —/ T ||, but 
we will argue that this is not the case. 

In the usual decision theoretic settings, risk calculations are devices used to rank the quality 
of estimators under the different prior assumptions a statistician can make, and the minimax risk 
bounds the risk under the least-favorable prior knowledge she might have about the parameter 
in question. As such, any risk calculation will not account for any information not included in 
the prior. Specifically, the integrated risk (Eq. (3.2)) effectively only integrates over the domain 
of the posterior, either by discounting the frequentist risk (Eq. (3.1)) by a factor of 0 whenever 
V{f ) = 0, or, equivalently, by discounting the loss whenever V(f |y) = 0 in the calculation of 
the posterior expected loss (Eq. (3.4)). Whenever the true / is outside of the posterior domain, 
the loss it incurs is completely removed from consideration. Risk therefore describes the loss the 
statistician expects subjectively, as opposed to the loss she is expected to experience objectively for 
a true /. 

In the context of a Bayesian HMM, the risk of the HMM estimator truly corresponds to the 
prior expectations we have about the latent parameters, as encoded by our hyperparameters t. 
The HMM estimator summarizes the posterior distribution of the Bayesian HMM from which we 
sample after FBG has converged. The smoothing estimator, on the other hand, summarizes the 
posterior distribution 

V(q \y,A,6,n), 

from which we sample the state sequence in each FBG iteration, through the position-wise 
weighted average of state means. Its risk, too, is bounded by the minimax risk; however, this 
distribution does not encode subjective prior knowledge in the Bayesian sense: any state sequence 
for which at least one emission mean is not in 6 has probability zero in both the prior and the 
posterior. It is extremely unlikely for the true state means to be sampled precisely, especially 
during burn-in, hence the true loss incurred in sampling is not accounted for in most cases. For 
this reason, it would be a fallacy to expect that ||/ th — /xll is small, since the assumptions in a 
prior M are violated unless all its state means are those of/. Perhaps not surprisingly, the block 
boundaries in wavelet compression cannot be expected to follow the most likely state transitions 
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for arbitrary HMM parameters for M. While this could be seen as problematic, we would argue 
that in fact this might help with convergence behavior of FBG. Wavelet compression changes the 
distribution V{q \ y,A,0,n) by suppressing state distributions occurring far away from changes 
in the true posterior marginals of P(q | y, t) in the Bayesian HMM during the burn-in phase. 


4.3 Heteroscedastic HMM 

While we show that Haar wavelet regression can be used as an approximation for the expected 
posterior state means in a cr-HMM, the squared loss itself is of limited concern when regression 
is used for compression, as it is the preservation of discontinuities in /, induced by states with 
separated means, that prevents over-compression. This is a concern in particular in that wavelet 
shrinkage using the universal threshold has been found to underfit the data (Antoniadis & 
Oppenheim 1995; Fan et al. 1993). Furthermore, results concerning estimators are not easily 
extended to heteroscedastic y, as obtained by an HMM in which emission variances are not the 
same for all states. 


4.3.1 Preservation of discontinuities 


It is well-established that wavelet regression yields high-amplitude coefficients for wavelets 
spanning a discontinuity in /. For homoscedastic data, breakpoint preservation at coarser 
resolution levels follows directly from oracle projection, for which universal thresholding is 
minimax and no better diagonal estimator can be obtained: Assume that, at scale j, all emissions 
at t G Lj have emission mean /i L , and those in Rj have /i R . If we assume that (/, ipj,k) A a 
whenever / has a discontinuity at bt fc , oracle projection (Eq. (3.8)) would set 


*j,k 


1 A/i>^C7 

0 else, 


creating discontinuities at b± k in /. Means with sufficiently large separation A/i are therefore 
unproblematic. However, large noise variance cr 2 can cause an obvious problem. An oracle 
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concerned with breakpoint preservation, however, would set 

( 


1 


a j,k = i 


0 


Apt > 0 
else. 


Asymptotically, the two are equivalent, as 


lim —-—-cr = 0, 

H °° V2J 

meaning that even breakpoints between insufficiently well separated states are indeed preserved 
for sufficiently long state durations. Notice that any introduction of additional discontinuities 
around that position would only create additional approximate breakpoints, and the number of 
spurious discontinuities far away from state transitions is limited, since noise coefficients are 
removed with high probability. 

For the heteroscedastic Gaussian case with general state emission variances, the probability 
of missing a breakpoint can be quantified directly. The inner product of a Haar wavelet with the 
data is a sum of Gaussian random variables with means corresponding to the latent parameters 
pi q [t] for all t in the support interval. It thus follows directly from the additive properties of the 
Normal distribution that the coefficients are distributed as 


(y,^ ; )~iv(pi qj , o-qj, with 


H qJ --=(^j,q)=^= ( ’ and 

\teLj teRj J 

°lj-=^p e ) = h £ <d[tr 

teLjURj 

Note how the mean is determined by the signal coefficient alone, and that the variance, determined 
solely by the noise coefficients, is always the average noise variance across the support, regardless 
of j. 

Due to the symmetry of the normal distribution, let /i L > /i R and A/t := (pi L — /i R ) w.l.o.g., so 
that (t/jj, q ) = A/i is non-negative. It follows that 


(^;,e)~IV^0, 


ct l + cj k 


2 
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and hence 




a2 L + Cj2 R 


For any given threshold A, we can use this density to directly quantify the probability of under¬ 
compressing a state transition as 

( fV 27, 


’((y>^;) e [-^]) = 


—A 


yf^of+of) 


exp 


V 




^L + ^R 


dx. 


This integral tends to 0 for a large shift Apt away from 0, smaller sums of noise variances, or 
larger scale j. Note that this distribution could also be used directly if a specific mean difference 
is to be resolved with a given probability at a given scale. In practice, however, we found that 
such measures were unnecessary for the data we analyzed. 


4.3.2 Dynamic Haar compression 

The homoscedastic case does not typically occur in practice. Even on experimental platforms such 
as aCGH, non-diploid states tend to have higher variance. As a result, a direct estimate of emission 
variances like the one above is not available. Instead, assume 9 was known. As lower thresholds 
remove less wavelet coefficients, thereby retaining a higher number of discontinuities, which 
correspond to potential state transitions, using the minimum variance cr^ in e 6 is a sensible 
way to create compressed data via wavelet shrinkage. Parts of the data generated by states 
whose emission variance equals cr^. n will be compressed as before due to polynomial suppression. 
Higher-variance regions will contain wavelets higher than expected for T observations of minimum 
variance, thus retaining additional wavelets and discontinuities. As the set of breakpoints for 
higher thresholds is a subset of those for lower ones, these regions will be compressed like regions 
with noise variance <T^ in , with additional, superfluous block boundaries thrown in. In other 
words, high-variance regions will be under-compressed, but no block boundary will be missed 
which wouldn’t be missed in the uniform variance case as well. 

Unfortunately, 0 is typically latent and has to be inferred. FBG converges to the correct 6 
over time, providing increasingly accurate a priori samples at each iteration. Therefore, samples 
of 9 can be used to derive approximate noise levels. We hence propose the following simple 
approach: In each FBG sampling iteration, we use the smallest sampled variance parameter cr^ in 
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Algorithm 1 Dynamically adaptive FBG for HMMs 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


procedure HAMMLETfy, t a , t q , 

T <— | y | > get data size 

A <— v 7 2 In T > constant factor in universal threshold 

A r^i vu\ t^) > prior sampling of transition probabilities 

0~ V(6 |Tg) > prior sampling of emission parameters 

n ~ 'P(jt | t„) > prior sampling of state distribution 

for i«— 1,... ,1V do > iterate Gibbs sampler 

crmi n <— mino-, {cj mad , cr; | cr^ G 0} > minimum emission variance 

Create block sequence B from threshold Acr min 
q ~ V{q | A, B, 6, n ) using Forward-Backward sampling 
Add count of marginal states for q to result 
A~V(a\t* a ) =V(A\n,q,r A ) oc V(n,q\A)V(A\T A ) 

0~p(0|T* e ) = P(0|q,B,T e )°c V{q,B\e)V{e\T 6 ) 
n ~'P(n\'r* n ) = 'P(n\A,q,z n ') oc V{A,q \n)V{n \t„) 


to create a new block sequence via wavelet thresholding (Algorithm 1). The method is illustrated 
in Fig. 4.2. 

A potential problem could arise if all sampled variances are too large. In this case, blocks 
would be under-segmented, yield wrong posterior variances and hide possible state transitions. 
As a safeguard against over-compression, we use cr^ AD as an estimate of the variance in the 
dominant component, and modify the threshold definition to A • min {ctmad> cr; e 9}. If the data 
is not i.i.d., cr A]AD will systematically underestimate the true variance (Wang & Wang 2007). In 
this case, the blocks get smaller than necessary, thus decreasing the compression. 

4.3.3 The wavelet tree data structure 

The necessity to recreate a new block sequence in each iteration based on the most recent estimate 
of the smallest variance parameter creates the challenge of doing so with little computational 
overhead, specifically without repeatedly computing the inverse wavelet transform or considering 
all T elements in other ways. We achieve this by creating a simple tree-based data structure. 

The pyramid algorithm yields d sorted according to (;, k). Again, let h := Id T, and j := h — j. 
We can map the wavelet xpj k to a perfect binary tree of height h such that all wavelets for scale j 
are nodes on level j, nodes within each level are sorted according to k, and j is increasing from 
the leaves to the root (Fig. 4.3). The vector d represents a breadth-first search (BFS) traversal of 
that tree, with dj ^ being the entry at position [2 ] J + k. Adding y[i] as the i-th leaf on level j = 0, 
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(a) Standard FB Gibbs 


- > 

Genome position 


Per-position latent variables 


(b) Wavelet Compression 



• • • < 


Sufficient 

statistics 


Per-block latent variables 



Compute wavelet compression 
for minimum variance in 9 


Small minimum variance 


Genome position 


Large minimum variance 


Genome position 


Figure 4.2: Overview of HaMMLET. Instead of individual computations per observation (panel a), 
Forward-Backward Gibbs Sampling is performed on a compressed version of the data, using sufficient 
statistics for block-wise computations (panel b) to accelerate inference in Bayesian Hidden Markov 
Models. During the sampling (panel c) parameters and copy number sequences are sampled iteratively. 
During each iteration, the sampled variances determine which coefficients of the data’s Haar wavelet 
transform are dynamically set to zero. This controls potential break points at finer or coarser resolution 
or, equivalently, defines blocks of variable number and size (panel c, bottom). 


each non-leaf node represents a wavelet which is non-zero for the n := 2 ; data points y[t], for 
t in the interval Ij k := [ kn , (k + l)n — 1] stored in the leaves below; notice that for the leaves, 
kn = t. 

This implies that the leaves in any subtree all have the same value after wavelet thresholding 
if all the wavelets in this subtree are set to zero. We can hence avoid computing the inverse 
wavelet transform to create blocks. Instead, each node stores the maximum absolute wavelet 
coefficient in the entire subtree, as well as the sufficient statistics required for calculating the 
likelihood function. More formally, a node t corresponds to wavelet with j = h — j and 
t = k2 ; [ip -10 is simply constant on the [0,1) interval and has no effect on block creation, thus 
we discard it). Essentially, j numbers the levels beginning at the leaves, and t marks the start 
position of the block when pruning the subtree rooted at Nj t . The members stored in each node 


are: 
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0 5 10 15 

Starting position of block (t) 


Figure 4.3: Mapping of wavelets i p jk and data points y t to tree nodes N jt . Each node is the root of a 
subtree with n = 2 J leaves; pruning that subtree yields a block of size n, starting at position t. For 
instance, the node 1V 16 is located at position 13 of the DFS array (solid line), and corresponds to the 
wavelet ip 3 _ 3 . A block of size n = 2 can be created by pruning the subtree, which amounts to advancing 
by 2n — 1 = 3 positions (dashed line), yielding JV 3 8 at position 16, which is the wavelet ipi.i- Thus the 
number of steps for creating blocks per iteration is at most the number of nodes in the tree, and thus 
strictly smaller than 2 T. 


• The number of leaves, corresponding to the block size: 

Nj, t [n] := 2> 


• The sum of data points stored in the subtree leaves: 

N jit [^] := X ^[1] 


N jlt [ Z 2 ] : = Ya y[i]2 


Similarly, the sum of squares: 


Level (z?) 
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• The maximum absolute wavelet coefficient of the subtree, including the current djj. itself: 


N 0 , t [d] 


0 N/> 0>t [d] :=max| d h _ r 2f/t , | 


j<i 

t<t'<t+2’ 


All these values can be computed recursively from the child nodes in linear time. As some real 
data sets contain salt-and-pepper noise, which manifests as isolated large coefficients on the 
lowest level, its is possible to ignore the first level in the maximum computation so that no 
information to create a single-element block for outliers is passed up the tree. We refer to this 
technique as noise control. Notice that this does not imply that blocks are only created at even t, 
since true transitions manifest in coefficients on multiple levels. 

The block creation algorithm is simple: upon construction, the tree is converted to depth-first 
search (DFS) order, which simply amounts to sorting the BFS array according to ( kn,j ), and 
can be performed using linear-time algorithms such as radix sort; internally, we implemented a 
different linear-time implementation mimicking tree traversal using a stack. Given a threshold, 
the tree is then traversed in DFS order by iterating linearly over the array (Fig. 4.3, solid lines). 
Once the maximum coefficient stored in a node is less or equal to the threshold, a block of size 
n is created, and the entire subtree is skipped (dashed lines). As the tree is perfect binary and 
complete, the next array position in DFS traversal after pruning the subtree rooted at the node at 
index i is simply obtained as i + 2n — 1 , so no expensive pointer structure needs to be maintained, 
leaving the tree data structure a simple flat array. An example of dynamic block creation is given 
in Fig. 4.4. 


Proposition 4.3.1. Awavelet tree produces apartition of the data into K blocks in 0(2iC—1) = 0(K) 
time. Thus, each block is created in expected constant time. 


Proof. This follows trivially from the fact that DFS with pruning of a perfect binary tree is 
equivalent to a DFS traversal on a full binary tree, where each leaf corresponds to a block. By a 
standard induction argument, a full binary tree with K leaves has 2 K — 1 nodes, each of which is 
expanded exactly once during block creation. □ 

Once the Gibbs sampler converges to a set of variances, the block structure is less likely to 
change. To avoid recreating the same block structure over and over again, we employ a technique 
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Position along chromosome 


Figure 4.4: Example of dynamic block creation. The data is of size T = 256, so the wavelet tree 
contains 512 nodes. Here, only 37 entries had to be checked against the threshold (dark line), 19 
of which (round markers) yielded a block (vertical lines on the bottom). Sampling is hence done on 
a short array of 19 blocks instead of 256 individual values, thus the compression ratio is 13.5. The 
horizontal lines in the bottom subplot are the block means derived from the sufficient statistics in 
the nodes. Notice how the algorithm creates small blocks around the breakpoints, e. g. at t ss 125, 
which requires traversing to lower levels and thus induces some additional blocks in other parts of the 
tree (left subtree), since all block sizes are powers of 2. This somewhat reduces the compression ratio, 
which is unproblematic as it increases the degrees of freedom in the sampler. 


called block structure memoization. Given a block structure B± created by threshold any A 1; a 
second, lower threshold A 2 < A, creates either the same block structure B 1 , or a block structure 
B 2 in which some blocks of B ] are subdivided into smaller blocks, increasing the number of 
total blocks. Therefore, we can partition the set of possible values for A into intervals [A;, A ; -], 
each of which is associated with a particular, unique number of blocks. Thus, for each block 
sequence length we register the minimum and maximum variance that creates that sequence. 
Upon entering a new iteration, we check if the current variance would create the same number 
of blocks as in the previous iteration, which guarantees that we would obtain the same block 
sequence, and hence can avoid recomputation. 


The wavelet tree data structure can be readily extended to multivariate data of dimensionality 
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Position along Chromosome 


Figure 4.5: An example of a multivariate wavelet tree. The top figure shows the topology of a wavelet 
tree for each of the two data dimensions below. The red and blue paths denote the pruning paths for 
the first and second data dimension, respectively. The bottom two subplots show the block boundaries 
for each dimension. The joint block boundaries for this bivariate data would consist of the union of 
those boundaries. Since they are incurred by entries above the threshold in their respective wavelet 
tree, the trees can be merged into one by taking the maximum absolute detail coefficient across 
dimensions to create the union of block boundaries. Hence, the runtime for finding block boundaries 
does not depend on the dimensionality of the data. 


m (Fig. 4.5). Instead of storing m different trees and reconciling m different block patterns in 
each iteration, one simply stores m different values for each sufficient statistic in a tree node. 
Since we have to traverse into the combined tree if the coefficient of any of the m trees was 
below the threshold, we simply store the largest Nj t [d] among the corresponding nodes of 
the trees, which means that the block creation can be done in O(T) instead of O(mT), i. e. the 
dimensionality of the data only enters into the creation of the data structure, but not the query 
during sampling iterations. 
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4.3.4 Automatic priors 


While Bayesian methods allow for inductive bias such as the expected location of means, it 
is desirable to be able to use our method even when little domain knowledge exists, or large 
variation is expected, such as the lab and batch effects commonly observed in micro-arrays (Luo 
et al. 2010), as well as unknown means due to sample contamination. Since FBG does require a 
prior even in that case, we propose the following method to specify hyperparameters of a weak 
prior automatically. Posterior samples of means and variances are drawn from a Normal-Inverse 
Gamma distribution (/i, cr 2 ) ~ Nir(/i 0 , v, a, /)), whose marginals simply separate into a Normal 
and an Inverse Gamma distribution 


cr 2 ~ ir(a,/3), /i ~ IV /i 0 ,— 

v 


cr 


Let s 2 be a user-defined variance (or automatically inferred, e. g. from the largest of the finest 
detail coefficients, or ct 2 a|) ), and p the desired probability to sample a variance not larger than 


s 2 . From the CDF of If we obtain 


p := P(cr 2 < s 2 ) = 


2r - 


r (a) 


=qk4i- 


ir has a mean for a > 1, and closed-form solutions for a e N. Furthermore, ir has positive 
skewness for a > 3. We thus let a = 2, which yields 


/ 3=- 5 2 ^W_ 1 (-^) + l), 0 < p < 1, 


where W_ ] is the negative branch of the Lambert W-function, which is transcendental. However, 
an excellent analytical approximation with a maximum error of 0.025% is given in Barry et al. 
(2000), which yields 


P> 


2Vb 


M 1 V~b + \/2 (M 2 b exp (M 3 -/b) + l) 


+ b 


b := —lnp, 

M 1 := 0.3361, M 2 := -0.0042, M 3 := -0.0201. 


Since the mean of ir is ^-y, the expected variance of p is ^ for a = 2. To ensure proper mixing, 
we could simply set j- to the sample variance of the data, which can be estimated from the 



62 


sufficient statistics in the root of the wavelet tree (the first entry in the array), provided that /jl 
contained all states in almost equal number. However, due to possible class imbalance, means for 
short segments far away from /r 0 can have low sampling probability, as they do not contribute 
much to the sample variance of the data. We thus define 5 to be the sample variance of block 
means in the compression obtained by ct 2 |A|) , and take the maximum of those two variances. We 
thus obtain 


Mo : = 


Si 

J 

n 


and v = (3 max 



4.3.5 Numerical issues 


To assure numerical stability when working with probabilities, many HMM implementations 
resort to log-space computations, which involves a considerable number of expensive function 
calls (exp, log, pow); for instance, on Intel’s Nehalem architecture, log (FYL2X) requires 55 
operations as opposed to 1 for adding and multiplying floating point numbers (FADD, FMUL) 
(Fog 2016). Our implementation, which differs from (Mahmud & Schliep 2011) greatly reduces 
the number of such calls by utilizing the block structure: The term accounting for emissions and 


self-transitions within the block can be written as 

1 

exp 


A 


jj 


{2n ) n J 2 cr" w 


_^(y[w][/c]-M;') 2 ' 


k =1 


2cr 2 

j 


Any constant cancels out during normalization. Furthermore, exponentiation of potentially small 
numbers causes underflows. We hence move those terms into the exponent, utilizing the much 
stabler logarithm function. 


exp 


■S 


2a 2 


+ (n w - 1) log Ajj - n w log a 


V k=i — j 

Using the block’s sufficient statistics 

n w 


n w , Ei^yMW, S 2 :='^y[w][k] 2 


k =1 


k =1 


the exponent can be rewritten as 


E w (j) := 


2/T.-E 




2cH 


+ K(n w ,j ), 


K(n w ,j ) :=(n w -l)logA --n w logcr + 


MJ 

2cr 2 
j ■ 
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K(n w , j) can be precomputed for each iteration, thus greatly reducing the number of expensive 
function calls. Notice that the expressions above correspond to the canonical exponential family 
form exp((t(x), 6) — F(6) + k(x)) of a product of Gaussian distributions. Hence, equivalent 
terms can easily be derived for non-Gaussian emissions, implying that the same optimizations 
can be used in the general case of exponential family distributions: Only the dot product of the 
sufficient statistics t(x) and the parameters 0 has to be computed in each iteration and for each 
block, while the log-normalizer F{0) can be precomputed for each iteration, and the carrier 
measure k(x) (which is 0 for Gaussian emissions) only has to be computed once. 

To avoid overflow of the exponential function, we subtract the largest such exponents among 
all states, hence E w (j ) < 0. This is equivalent to dividing the forward variables by 

exp f max E w w), 


which cancels out during normalization. Hence we obtain 


a, 


, 0 ') := 


exp ( E w (j) - maxE w (k )) ^ a-w-iWAj, 

J i i 


which are then normalized to 


a w (j) = 


a w (j) 


4.4 Evaluation 

4.4.1 Simulated aCGH data 

A previous survey (Lai et al. 2005) of eleven CNV calling methods for aCGH has established 
that segmentation-focused methods such as DNAcopy (Olshen & Venkatraman 2002; Olshen, 
Venkatraman, et al. 2004), an implementation of circular binary segmentation (CBS), as well 
as CGHseg (Picard et al. 2005) perform consistently well. DNAcopy performs a number of 
t-tests to detect break-point candidates. The result is typically over-segmented and requires a 
merging step in post-processing, especially to reduce the number of segment means. To this 
end MergeLevels was introduced by Willenbrock & Fridlyand (2005). They compare the 
combination DNAcopy+MergeLevels to their own HMM implementation (Fridlyand et al. 2004) 
as well as GLAD (Hupe et al. 2004), showing its superior performance over both methods. 
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Average compression ratio 


Figure 4.6: HaMMLET’s speedup as a function of the average compression during sampling. As 
expected, higher compression leads to greater speedup. The non-linear characteristic is due to the fact 
that some overhead is incurred by the dynamic compression, as well as parts of the implementation 
that do not depend on the compression, such as tallying marginal counts. 


This established DNAcopy+MergeLevels as the de facto standard in CNV detection, despite the 
comparatively long running time. 

The paper also includes aCGH simulations deemed to be reasonably realistic by the community. 
DNACopy was used to segment 145 unpublished samples of breast cancer data, and subsequently 
labeled as copy numbers 0 to 5 by sorting them into bins with boundaries 

(—oo,—0.4, —0.2,0.2,0.4,0.6, oo), 

based on the sample mean in each segment (the last bin appears to not be used). Empirical length 
distributions were derived, from which the sizes of CN aberrations are drawn. The data itself 
is modeled to include Gaussian noise, which has been established as sufficient for aCGH data 
(Hodgson et al. 2001). Means were generated such as to mimic random tumor cell proportions, 
and random variances were chosen to simulate experimenter bias often observed in real data; this 
emphasizes the importance of having automatic priors available when using Bayesian methods, as 
the means and variances might be unknown a priori. The data comprises three sets of simulations: 
“breakpoint detection and merging” (BD&M), “spatial resolution study” (SRS), and “testing” (T) 
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(see their paper for details). We used the MergeLevels implementation as provided on their 
website. It should be noted that the superiority of DNAcopy+MergeLevels was established using 
a simulation based upon segmentation results of DNAcopy itself. 

We used the Bioconductor package DNAcopy (version 1.24.0), and followed the procedure 
suggested therein, including outlier smoothing. This version uses the linear-time variety of CBS 
(Venkatraman & Olshen 2007); note that other authors such as Tsourakakis et al. (2011) 
compare against a quadratic-time version of CBS (Olshen, Venkatraman, et al. 2004), which 
is significantly slower. For HaMMLET, we use a 5-state model with automatic hyperparameters 
P(cr 2 < 0.01) = 0.9 (see Section 4.3.4), and all Dirichlet hyperparameters set to 1. 

Following Mahmud & Schliep (2011), we report F-measures (Fi scores) for binary classi¬ 
fication into normal and aberrant segments (Fig. 4.10), using the usual definition of F = 

TD Tn 

being the harmonic mean of precision n = Tp+Fp and recall p = Tp+FN , where TP, FP, TN and FN 
denote true/false positives/negatives, respectively. On datasets T and BD&M, both methods have 
similar medians, but HaMMLET has a much better interquartile range (IQR) and range, about 
half of CBS’s. On the spatial resolution data set (SRS), HaMMLET performs much better on very 
small aberrations. This might seem somewhat surprising, as short segments could easily get lost 
under compression. However, Lai et al. (2005) have noted that smoothing-based methods such 
as quantile smoothing (quantreg) (Eilers & de Menezes 2005), lowess (Cleveland 1979), 
and wavelet smoothing (Hsu et al. 2005) perform particularly well in the presence of high noise 
and small CN aberrations, suggesting that “an optimal combination of the smoothing step and 
the segmentation step may result in improved performance”. Our wavelet-based compression 
inherits those properties. For CNVs of sizes between 5 and 10, CBS and HaMMLET have sim¬ 
ilar ranges, with CBS being skewed towards better values; CBS has a slightly higher median 
for 10-20, with IQR and range being about the same. However, while HaMMLET’s F-measure 
consistently approaches 1 for larger aberrations, CBS does not appear to significantly improve 
after size 10. The plots for all individual samples can be found in Web Supplement S1-S3, which 
can be viewed online at http://schlieplab.org/Supplements/HaMMLET/, or downloaded from 
https://zenodo.org/record/46263 (DOI: 10.5281/zenodo.46263). 
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Figure 4.7: F-measures of CBS (light) and HaMMLET (dark) for calling aberrant copy numbers on 
simulated aCGH data (Willenbrock & Fridlyand 2005). Boxes represent the interquartile range 
(IQR = Q3—Ql), with a horizontal line showing the median (Q2), whiskers representing the range 
(|lQR beyond Ql and Q3), and the bullet representing the mean. HaMMLET has the same or better 
F-measures in most cases, and on the SRS simulation converges to 1 for larger segments, whereas CBS 
plateaus for aberrations greater than 10. 


4.4.2 High-density CGH array 

In this section, we demonstrate HaMMLET’s performance on biological data. Due to the lack 
of a gold standard for high-resolution platforms, we assess the CNV calls qualitatively. We use 
raw aCGH data (GEO:GSE23949) (Edgren et al. 2011) of genomic DNA from breast cancer 
cell line BT-474 (invasive ductal carcinoma, GEO:GSM590105), on an Agilent-021529 Human 
CGH Whole Genome Microarray lxlM platform (GEO:GPL8736). We excluded gonosomes, 
mitochondrial and random chromosomes from the data, leaving 966,432 probes in total. 

HaMMLET allows for using automatic emission priors (see Section 4.3.4) by specifying a noise 
variance, and a probability to sample a variance not exceeding this value. We compare HaMMLET’s 
performance against CBS, using a 20-state model with automatic priors, P (cr 2 < O.l) = 0.8, 10 
prior self-transitions and 1 for all other hyperparameters. CBS took over 2 h 9 min to process the 
entire array, whereas HaMMLET took 27.1 s for 100 iterations, a speedup of 288. The compression 
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ratio (see Section 4.4.3) was 220.3. CBS yielded a massive over-segmentation into 1,548 different 
copy number levels; cf. Supplement S4 at https://zenodo.org/record/46263. As the data is 
derived from a relatively homogeneous cell line as opposed to a biopsy, we do not expect the 
presence of subclonal populations to be a contributing factor (Burdall et al. 2003; Holliday & 
Speirs 2011). Instead, measurements on aCGH are known to be spatially correlated, resulting 
in a wave pattern which has to be removed in a preprocessing step; notice that the internal 
compression mechanism of HaMMLET is derived from a spatially adaptive regression method, so 
smoothing is inherent to our method. CBS performs such a smoothing, yet an unrealistically large 
number of different levels remains, likely due to residuals of said wave pattern. Furthermore, 
repeated runs of CBS yielded different numbers of levels, suggesting that indeed the merging 
was incomplete. This can cause considerable problems downstream, as many methods operate 
on labeled data. A common approach is to consider a small number of classes, typically 3 to 4, 
and associate them semantically with CN labels like loss, neutral, gain, and amplification, e.g. 
van Wieringen, van ve Wiel & Ylstra (2008), Liu et al. (2006), Gonzalez et al. (2009), 
van de Wiel & van Wieringen (2007), Shah, Lam, et al. (2007), Guha, Li & Neuberg (2006), 
Yin & Li (2009), Hodgson et al. (2001), and Hupe et al. (2004). In inference models that contain 
latent categorical state variables, like HMM, such an association is readily achieved by sorting 
classes according to their means. In contrast, methods like CBS typically yield a large, often 
unbounded number of classes, and reducing it is the declared purpose of merging algorithms, 
see Willenbrock & Fridlyand (2005). Consider, for instance, CGHregions (van de Wiel & 
van Wieringen 2007), which uses a 3-label matrix to define regions of shared CNV events across 
multiple samples by requiring a maximum L 1 distance of label signatures between all probes in 
that region. If the domain of class labels was unrestricted and potentially different in size for 
each sample, such a measure would not be meaningful, since the i-th out of n classes cannot 
be readily identified with the i-th out of m classes for n m, hence no two classes can be said 
to represent the same CN label. Similar arguments hold true for clustering based on Hamming 
distance (Liu et al. 2006) or ordinal similarity measures (van Wieringen, van ve Wiel & Ylstra 
2008). Furthermore, even CGHregions’ optimized computation of medoids takes several minutes 
to compute. As the time depends multiplicatively on the number of labels, increasing it by three 
orders of magnitude would increase downstream running times to many hours. 
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For a more comprehensive analysis, we restricted our evaluation to chromosome 20 (21,687 
probes), which we assessed to be the most complicated to infer, as it appears to have the highest 
number of different CN states and breakpoints. CBS yields a 19-state result after 15.78 s (Fig. 4.8, 
top). We have then used a 19-state model with automated priors (P(cr 2 < 0.04) = 0.9), 10 
prior self-transitions, all other Dirichlet parameters set to 1) to reproduce this result. Using 
noise control (see Section 4.3.3), our method took 1.61 s for 600 iterations. The solution we 
obtained is consistent with CBS (Fig. 4.8, middle and bottom). However, only 11 states were 
part of the final solution, i. e. 8 states yielded no significant likelihood above that of other states. 
We observe superfluous states being ignored in our simulations as well. In light of the results 
on the entire array, we suggest that the segmentation by DNAcopy has not sufficiently been 
merged by MergeLevels. Most strikingly, HaMMLET does not show any marginal support for 
a segment called by CBS around probe number 4,500. We have confirmed that this is not due 
to data compression, as the segment is broken up into multiple blocks in each iteration (cf. 
Supplement S5 at https://zenodo.org/record/46263). On the other hand, two much smaller 
segments called by CBS in the 17,000-20,000 range do have marginal support of about 40% in 
HaMMLET, suggesting that the lack of support for the larger segment is correct. It should be 
noted that inference differs between the entire array and chromosome 20 in both methods, since 
long-range effects have higher impact in larger data. 

We also demonstrate another feature of HaMMLET called noise control. While Gaussian 
emissions have been deemed a sufficiently accurate noise model for aCGH (Hodgson et al. 
2001), microarray data is prone to outliers, for example due to damages on the chip. While it is 
possible to model outliers directly (Shah, Xuan, et al. 2006), the characteristics of the wavelet 
transform allow us to largely suppress them during the construction of our data structure. Notice 
that due to noise control most outliers are correctly labeled according to the segment they occur 
in, while the short gain segment close to the beginning is called correctly. 

4.4.3 Effects of wavelet compression on speed and convergence 

The speedup gained by compression depends on how well the data can be compressed. Poor 
compression is expected when the means are not well separated, or short segments have small 
variance, which necessitates the creation of smaller blocks for the rest of the data to expose 
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potential low-variance segments to the sampler. On the other hand, data must not be over¬ 
compressed to avoid merging small aberrations with normal segments, which would decrease 
the F-measure. Due to the dynamic changes to the block structure, we measure the level of 
compression as the average compression ratio, defined as the product of the number of data 
points T and the number of iterations N, divided by the total number of blocks in all iterations. 
As usual a compression ratio of one indicates no compression. 

To evaluate the impact of dynamic wavelet compression on speed and convergence properties 
of an HMM, we created 129,600 different data sets with T = 32,768 many probes. In each 
data set, we randomly distributed 1 to 6 gains of a total length of {100,250,500, 750,1000} 
uniformly among the data, and do the same for losses. Mean combinations 


(Ml oss> Mneutrab Mgain) 

were chosen from (Id Id 1, Id |), (—1,0,1), (—2,0,2), and (—10, 0, 10), and variances 


( (j ^ t 

loss’ u neutral’ gain-' 


where chosen from (0.05,0.05,0.05), (0.5,0.1,0.9), (0.3,0.2,0.1), (0.2,0.1,0.3), (0.1,0.3,0.2), 
and (0.1,0.1,0.1). These values have been selected to yield a wide range of easy and hard cases, 
both well separated, low-variance data with large aberrant segments as well as cases in which 
small aberrations overlap significantly with the tail samples of high-variance neutral segments; 
an example of a hard inference task is shown in Fig. 4.9. Consequently, compression ratios 
range from ~1 to ~2,100. We use automatic priors P(cr 2 < 0.2) = 0.9, self-transition priors 
a it G {10,100,1000}, non-self transition priors a i} = 1, and initial state priors a € {1,10}. Using 
all possible combinations of the above yields 129,600 different simulated data sets, a total of 4.2 
billion values. 

We achieve speedups per iteration of up to 350 compared to an uncompressed HMM (Fig. 4.6). 
In contrast, Mahmud & Schliep (2011) have reported ratios of 10-60, with one instance of 90. 
Notice that the speedup is not linear in the compression ratio. While sampling itself is expected 
to yield linear speedup, the marginal counts still have to be tallied individually for each position, 
and dynamic block creation causes some overhead. The quantization artifacts observed for larger 
speedup are likely due to the limited resolution of the Linux time command (10 ms). Compressed 
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HaMMLET took about 11.2 CPU hours for all 129,600 simulations, whereas the uncompressed 
version took over 3 weeks and 5 days. All running times reported are CPU time measured on a 
single core of a AMD Opteron 6174 Processor, clocked at 2.2 GHz. 

We evaluate the convergence of the F-measure of compressed and uncompressed inference 
for each simulation. Since we are dealing with multi-class classification, we use the micro- and 
macro-averaged F-measures (F mi , F ma ) proposed by Ozgur, Ozgur & Gungor (2005): 


2np 
n + p 


E“rTP, 


with 


n = 


2f=iTP, 


iSrCTPi + FPi)' 


P = 


S^tCTPi + FN 0 
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V“iM j-, 
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- With 71; = -, p; = -, F; = -. 

M TP;+FP; TP;+FN; 7T; + P; 


Here, TP; denotes a true positive call for the i-th out of M states, n and p denote precision and 
recall. These F-measures tend to be dominated by the classifier’s performance on common and 
rare categories, respectively. Since all state labels are sampled from the same prior and hence 
their relative order is random, we used the label permutation which yielded the highest sum of 
micro- and macro-averaged F-measures. Instead of finding this permutation by brute force, which 
requires K\ computations for k states, we solve a combinatorial assignment problem: Let C be the 
confusion matrix of the segmentation, i. e. C tp is the number of true observations of state t which 
have been predicted as state p. Notice that TP; + FP; is just the sum C; of the i-th column of C, 
and likewise TP, + FN; is the sum P, of the i-th row. Hence, F mi can be computed as v "\ . The 
best F mi is thus obtained by the column permutation which maximizes the trace. Likewise, let 


Then 
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2 tt tpPtp C tp 

M tn := —7-- + -, 

K(n tp + p tp ) ^tp 

and the maximum sum is attained by the column permutation which maximizes the trace. This 
is the maximum weighted bipartite matching problem, which can easily be transformed into 
the assignment problem and solved in 0(fC 3 ) time using the Hungarian method due to Konig, 
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Egervary and Munkres (Kuhn 1955). The simulation results are included in Supplement S6 at 
https://zenodo.org/record/46263. 

In Fig. 4.10, we show that the compressed version of the Gibbs sampler converges almost 
instantly, whereas the uncompressed version converges much slower, with about 5% of the cases 
failing to yield an F-measure > 0.6 within 1,000 iterations. Wavelet compression is likely to yield 
reasonably large blocks for the majority class early on, which leads to a strong posterior estimate 
of its parameters and self-transition probabilities. As expected, F ma are generally worse, since 
any misclassification in a rare class has a larger impact. Especially in the uncompressed version, 
we observe that F ma tends to plateau until F mi approaches 1.0. Since any misclassification in the 
majority (neutral) class adds false positives to the minority classes, this effect is expected. It 
implies that correct labeling of the majority class is a necessary condition for correct labeling of 
minority classes, in other words, correct identification of the rare, interesting segments requires 
the sampler to properly converge, which is much harder to achieve without compression. It 
should be noted that running compressed HaMMLET for 1,000 iterations is unnecessary on the 
simulated data, as in all cases it converges between 25 and 50 iterations. Thus, for all practical 
purposes, further speedup by a factor of 40-80 can be achieved by reducing the number of 
iterations, which yields convergence up to 3 orders of magnitude faster than standard FBG. 

4.4.4 Coriell, ATCC and breast carcinoma 

The data provided by Snijders, Nowak, et al. (2001) includes 15 aCGH samples for the Coriell 
cell line. At about 2,000 probes, the data is small compared to modern high-density arrays. 
Nevertheless, these data sets have become a common standard to evaluate CNV calling methods, 
as they contain few and simple aberrations. The data also contains 6 ATCC cell lines as well as 
4 breast carcinoma, all of which are substantially more complicated, and typically not used in 
software evaluations. In Fig. 4.11, we demonstrate our ability to infer the correct segments on 
the most complex example, a T47D breast ductal carcinoma sample of a 54 year old female. We 
used 6-state automatic priors with P (cr 2 < O.l) = 0.85, and all Dirichlet hyperparameters set to 
1. On a standard laptop, HaMMLET took 0.09 seconds for 1,000 iterations; running times for the 
other samples were similar. Our results for all 25 data sets have been included in Supplement S7 
at https://zenodo.org/record/46263. 
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4.5 Forward bias and convergence behavior 

We have shown that, compared to uncompressed FBG, state sequences sampled from HaMMLET 
very quickly approach the ground truth. If initial iterations are treated as burn-in, tallying the 
maximum marginals approaches the MPM segmentation. Since this behavior is measured in the 
number of iterations, not CPU time, the qualitative differences must be due to features of wavelet 
compression other than the mere reduction in data size. 

Compressing the data into blocks assumes that all emissions within the block were generated 
by the same latent HMM state. However, this ignores the contribution of state paths that switch 
states within the block, and raise the question of bias in the sampler. A typical assumption is 
that the overall contribution of such paths is small, since non-generating states will yield low 
emission likelihoods for the observed data, and off-diagonal probabilities in A are often small 
in practice. Mahmud & Schliep (2011) showed that this weak path assumption holds for a 
homoscedastic Gaussian HMM with well-separated means if all emissions within a block are 
closer to the mean of their generating state than to those of any other state. The proof does not 
distinguish between the parameters of the generating HMM and the ones used in the forward 
recursion, meaning that, for the proof to hold, we have to assume strong emission posteriors 
right from the beginning. Of course, a Gibbs sampler might not necessarily yield some ij x 
for which these assumptions hold, especially during the burn-in phase. We therefore derive 
the multiplicative error of normalized forward variables under arbitrary compression and for 
general—not necessarily EFD, homoscedastic or univariate—HMM: 

Definition 4.5.1 (Forward bias). Let H be a K-state Hidden Markov Model states, such that all 
emission likelihoods are positive. Let a t [s]be the normalized forward variable of state s at position t. 
Let a t [s]be the normalized approximate forward variable calculated from a t by ignoring transitions 
between different states. Let 0 < i,j,s < K, and 

a x := a := a t _ x 

ix ■='P<iy[t]\q[t] = x) i :=P(y[t]|q[t]) 

Xij CLiAijtj, 
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so that 


a t [s] = 


Si x is 

2W 


a t [s] = =^~. 

2lj x Jj 

Then we define the forward bias of state s at position t as 

« t [s] X ss 2iJ X U 


Ct ■= 


® t l>] Hn x . 




Proposition 4.5.1 (Upper bound on forward bias). The forward bias is bounded as 

2^ j x ij 


c t£ |^ = 1 + 




'} 


In particular, if the transition probabilities are bounded as 


Vj : Ajj > 5, 


Vi # j : Atj < e, 


the forward error is bounded as 


C,<l + I 



Proof 


Furthermore, 


X ss'YiiJ x ij _ X ss (Xi; X jj + x ij) 

Eij x jj x is S(i=s) x ././ x .« + x jj x is 

J J 

x ss (S; x jj + x ij) _ 2i,j x 0' 

x ss x jj 2/ x jj 


T.j x jj | 

2j x jj 


= i + 


2/ X jj 


C t < 1 + 


= 1 + 


2 j x jj 

Si/j Wj 


< 1 + 


c Z#j g jl; 

5 Xj a A 


(4.1) 

(4.2) 

(4.3) 

(4.4) 

(4.5) 


(4.6) 
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Z,< 




Z,-« 


,j “•j'-j 


(4.7) 

(4.8) 

(4.9) 

(4.10) 

□ 


Obviously, the bias is small for HMM with strong self-transitions, as well as a s and ( s approach¬ 
ing 1. For Gaussian HMM, these criteria are typically met for emissions with low variance and 
well-separated means whenever the sampled G 1 '' 1 approaches the true emitting G, confirming the 
result of Mahmud & Schliep (2011) in those special cases. In general, due to the rearrangement 
inequality, the bound is tightest whenever the entries of a and £ are in the same order. 

We conjecture that the forward bias is the reason for the observed fast convergence towards 
MPM segmentation. Consider the case of uniform self-transition and transition probabilities: 

Proposition 4.5.2 (Bias towards largest forward variable). Let 


s = argmaxcq 
i 

be the state with the largest forward variable at t — 1. Let Vi : A it = 5 and Vi^j : A = e. Then 


C t > 1. 


Proof. The forward bias of state s is 


_ x ssZiJ x u _ x ss Zij x ij _ *ss (Z; + Zi£/ x ij) 

'LiJ x jj x is (z, x jj ) (Si *is) (Zj x )j) (*« + Zi* *is) 

X ss Z; X jj "b X ss Zfjtj X ij Z j x jj + Zi^; X ij 

= S, Wi + (Si x„) (£* 73 = 2, + Y., XJ S L 

ITS 

Z j X jj "b Zi/; a i£lj Zj x jj "b e Zi/j a i^j 


f i X Jj 


+ Z j 

i H 


ctjdCjCiiets 
a,51, 




U + e Z j CLi^ii 

l£s 
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5 Hj a /j + e 


5 Hj a jtj + e 


a s^j 2 iti a i^i 
its 


a, tj + Yiit) a i^j a 


its 


a • dr cl • 

The claim follows since V/ : a„ > a,-, so — < 1 and a=~r>—. □ 

J J a s * a s a s 

In fact, a sufficient criterion can be established for the direction of the forward bias which 
does not involve the emission likelihoods. 


Proposition 4.5.3 (Direction of forward bias). Let the number of states K > 3. Then for any 
relation ~ e {< , = , >}, the forward bias of state s 


C t := 


«tl>] _ 1 
a t [s] 


if and only if 


E h (“.A, (m,7 + a i A ii (oiA. +1^)) - 0. 

its 

its 

it) 

In particular, the relation holds if 


a s A ss Aij ~ ajAjjAi s 

Vj # s : a 5 AsAj - tfAjjAjs- 

Proof We rewrite C t ~ 1 as a comparison of numerator and denominator: 

Y^^sXij-Y^XjjXis 


i,j 




5 ] + 2 XssX ij x + 2 X b' Xis 


2 X « X U ^5] x h' Xis 

i i 

its jts 

2 x ss x jj + 2 :x 2 X JJ X « + 2 x w x is 


its its 

iti 


its ]ts 

its 


2 X « X U 

its 
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x ii Xis 

its 

its 


(4.11) 

(4.12) 


(4.13) 

(4.14) 

(4.15) 

(4.16) 

(4.17) 



76 


5] x ^ x ij + 2 XssXs i ~ 2 x jj x “ + 2 x ii x i* W.l 8 ) 

i/s j/s its j/s 

j/s 

m ¥j 

On each side, there are (K — l) 2 — (K — 1) = K 2 — 3K + 2 terms in the left and K — 1 in the right 
sum. In order to match the index set, we replace each summand on the right by K — 2 normalized 
copies, one for each j / i s on the left, yielding (fC — l)(fC — 2) = K 2 — 3K + 2 summands: 


i v - 1 x ss x sj sr~' V -1 x jj x js 

2_t x ss x ij + 2_i ~ x jj Xis + 2-i K ^2 

ijts its K A its its 

its its its jts 

¥) ¥) ¥) ¥i 

We then get 

2 ( x * ( x 6 + _ x n ( x - + ~ 0 

its 

j¥=s 

¥i 

Dividing by £ s yields 

V,, (a, a, Ua,+^ 1 ) -°Ai (m* + )) - (o 

its 

j^s 

¥j 

In particular, the proposition holds if 


s 

X ss x ij ~ J] X Jj X is 

(4.19) 

its 

its 


j^S 

¥s 


¥J 

¥i 


S 

x ss x sj "" x jj x js > 

(4.20) 

¥s 



which in turn holds especially if x holds for all matching summands, 


Vi # s,j 

S,i^j- x ss x ij - x jj x is 

(4.21) 

Vj # s : 

Y V • X Y - - Y- 
^SS^SJ JJ JS’ 

(4.22) 

hence 

£s,i£ } : 

d'S'Ass^'s^i^ij^j dj'Ajj(,jGLi<Ai s ('s 

(4.23) 

Vj # s : 


(4.24) 


and by canceling terms, 


Mi^s,j^s,i^j: a s A ss Aij x a ; - AjjA u 


(4.25) 
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Vj : a*A ss A sj ~ ajAjjA js . (4.26) 

□ 

Let s be the state with the largest forward variable at t, i. e. Vj : a t [s] > a t [j]. As shown 
before, for uniform 5 and e, those relations hold. However, the result also suggests a general 
tendency for a t [s] > a t [s] if self-transitions and other transitions are each on the same orders. 
Since ^ ^ and ^ ^ imply C t > 1, approximate equality „4 SS r* Ajj & 5 and \fi ^ j : 

Aij R* e imply that the forward bias will be C t > 1. 

We demonstrate this effect in Fig. 4.12: For 4 states, we simulated a and £ as random 
variates from a Dirichlet distribution Dir(10,1,1,1), and the rows of A as a Dirichlet variate with 
parameter 100 for the diagonal entries and t € {1,25,50,100} for the off-diagonal entries. We 
plot the elements of a against those of a, where the data point is shown in red for the maximal 
entry a s and blue for the others. Clearly, there is a tendency for a s to yield a t [s] > a t [s]. As 
expected, the effect is less pronounced for lower t, since those A have higher self-transition 
probabilities. 

Since we have shown before that wavelet compression creates a block structure which is 
consistent with the locations of state transition in MPM segmentation, s is likely to indicate 
the maximum posterior margin state for the entire block starting at t. Having a forward bias 
greater than 1 means that s will be sampled with higher probability than under unbiased forward 
filtering. Also, due to the recursive nature of forward-variable computation, the error is expected 
to grow within a block, with the other forward variables approaching 0; this effect is illustrated 
in Fig. 4.13. Therefore, a wavelet compressed block is sampled as the MPM state under the 
parametrization of the current iteration, with a probability approaching 1 for large blocks. 
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Figure 4.8: Copy number inference for chromosome 20 in invasive ductal carcinoma (21,687 probes). 
CBS creates a 19-state solution (top), however, a compressed 19-state HMM only supports an 11-state 
solution (bottom), suggesting insufficient level merging in CBS. 
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Figure 4.9: An example of a hard inference task from the simulations. Notice that the minority 
components have been correctly identified (green, dark blue), despite being very short and their means 
being contained well within the tails of the dominant emission distribution (light blue). 
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Figure 4.10: F-measures for simulation results. The median value (black) and quantile ranges (in 5% 
steps) of the micro- (top) and macro-averaged (bottom) F-measures (F mi , F ma ) for uncompressed (left) 
and compressed (right) FBG inference, on the same 129,600 simulated data sets, using automatic 
priors. The x-axis represents the number of iterations alone, and does not reflect the additional speedup 
obtained through compression. Notice that the compressed HMM converges no later than 50 iterations 
(inset figures, right). 
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Figure 4.11: HaMMLET’s inference of copy-number segments on T47D breast ductal carcinoma. 
Notice that the data is much more complex than the simple structure of a diploid majority class with 
some small aberrations typically observed for Coriell data. 
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(a) t = 100 


(b) t = 25 



Forward variable Forward variable 


(c) T = 10 (d) T = 1 

Figure 4.12: Demonstration of the tendency of compressed forward variables to be biased towards 
the state with the largest marginal probability (red), and away from the others (blue). Variables on 
the diagonal are unbiased. Subplots are shown for different Dirichlet distributions Dir(100, t, t, t). 
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Figure 4.13: A demonstration of the increasing forward bias due to recursive computation of forward 
variables, for different Dirichlet weights t. Regardless of the actual size of forward variables (upper 
subplots in each subfigure), the compressed forward variables quickly converge to 1 for the maximum 
state (blue). 
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Chapter 5 

Algorithm engineering for big data 
applications 


Having demonstrated the properties and theoretical background of HaMMLET in the previous 
chapters, we now address the issue of implementation beyond the prototype stage. While 
convergence and speed have been addressed, memory consumption remains problematic for 
genome-sized data. In this chapter, we describe our design decisions to scale HaMMLET to 
big-data applications. The main contributions in this chapter are as follows: 

1. We show that the wavelet tree data structure yields under-compression, resulting in trellises 
which are too large. 

2. The wavelet tree is also memory-inefficient: For a given T, it stores 2T — 1 sufficient 
statistics and as many maximum coefficients. This is the exact number of distinct statistics 
required for arbitrary thresholds; however, the statistics stored are not the ones that are 
queried. 

3. The wavelet tree was a monolithic data structure in which each node was used to store 
marginal state counts as well as a data point (leaves) or sufficient statistics (inner nodes). 
This implementation proved to be very wasteful, and did not scale to genomic data sizes. 

4. To alleviate those issues, we designed specialized data structures for different functionalities 
previously contained in the wavelet tree: block boundaries are stored in a breakpoint array, 
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statistics for arbitrary blocks of size N are queried in O(lniV) time from a modified integral 
array of size T, and a queue-based implementation of marginal records is used to store and 
update run-length encoded counts for marginal state counts. 

5. We show that for univariate data, the breakpoint array for Haar weights can be computed 
in-place in linear time. 

6. For multivariate data, we show that the query time for each block is independent of 
dimensionality; once the block boundaries have been determined in In N time, the statistics 
at dimension d can be queried in constant time, so iterating through all d dimensions of a 
block can be done in d + In N time. 

7. For multivariate data, a breakpoint array with Haar weights would require two arrays of 
size T to store the results of several Haar transforms in order to compute maxima across 
dimensions. Instead, we modify the lifting scheme for the Haar transform to process the 
input in a different order, allowing the data to be processed as an input stream without 
storing O(T) temporary values. This yields a computation of coefficient maxima across 
dimensions in dT time and T + d Id T space. 

5.1 Wavelet tree revisited 

To motivate the need for a different implementation, let T hg := 3,500,000,000, the approximate 
number of base pairs in the human genome. If wavelet weights and sufficient statistics are stored 
as IEEE-754 32-bit floating point numbers, then for Gaussian HMM we require a total of 5 ■ 8 T hg 
bytes, or more than 130 GB RAM. It is possible to create optimal compression, i. e. one that does 
not introduce more breakpoints than there are discontinuities in the wavelet regression, but that 
would require storing both the absolute as well as the maximum subtree coefficients, adding 
another 26 GB. 

Additionally, our original implementation stored the block sizes with the sufficient statistics, 
as the number of subtree leaves is not easily obtained from the BFS nor DFS pre-order index of a 
node in constant time, especially not if the number of leaves is not a power of 2 and the tree is 
truncated. For general implementations, the safe way to implement this is using an unsigned 
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integer type the size of a pointer on the target platform, and typically incurs at least another 8 
byte twice per input position. Furthermore, using the pyramid algorithm, we require padding 
of the data to a power of two, and an auxiliary array for calculating the wavelet coefficients, 
which in the worst-case scenario almost quadruples the input size. In that case, the memory 
requirement increases to 80T, or 26 GB for T hg . These requirements can be avoided by using a 
modified version of an in-place Haar wavelet transform. 

In addition, we previously used a tree-like layout to record the marginal counts for each block, 
which was wasteful and, in the worst case, could incur another 2 TK unsigned integer variables 
large enough to count the number of iterations. Even for 2-byte integers, a 20-state HMM would 
incur another 131 GB for the human genome. In total, our original implementation of HaMMLET 
would require close to 400 GB RAM for a realistic model. Several implementation details such as 
suboptimal use of STL vectors, each adding 20 byte overhead, pushed the memory load to the 
terabyte range, resulting in excessive swapping, rendering the implementation useless for WGS 
data. 

5.2 Dynamic block creation 

Consider the following abstract data structure: 

Definition 5.2.1 (Block generator). Let b be a vector of breakpoint weights. For a threshold 
A, let Yx be a partition of y into blocks such that there is a block boundary between positions 
t — 1 and t if b[t] > A. We call a data structure a block generator if it can, for any threshold A, 
generate an ordered sequence ofsujficient statistics that represents Y ? . A block generator is called 
compressive if for all A, b[t] < A implies that no breakpoint is created between t — 1 and t. It is 
called subcompressive if for some A such a superfluous block boundary is created. A block generator 
is called space-efficient if it stores no more than T sufficient statistics. 

This definition of a block generator implies that Yx is a subdivision of Y- k . 2 if A, < A 2 . For 
sufficiently small thresholds, we require sufficient statistics for each data point, hence any block 
generator implementation will have to store a minimum of T sufficient statistics. On the other 
hand, if all entries in b are unique, each breakpoint subdivides a block defined by a higher 
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threshold, and a simple induction argument shows that a block generator has to be able to 
generate 2T — 1 different blocks and their sufficient statistics: starting with a single block of 
size T and a sorted sequence of threshold values in b, each threshold creates two new blocks by 
subdividing one block in the previous partition. 

Proposition 5.2.1. The wavelet tree is a subcompressive and memory-inefficient block generator. 

Proof The wavelet tree is memory-inefficient as it stores 2T — 1 sufficient statistics. Wavelet 
compression is based on the fact that if all coefficients in a subtree are below the threshold, 
then so is the maximum coefficient in that subtree. The converse however is not true; if the 
maximum coefficient is above the threshold, then not all coefficients in the subtree have to be 
above threshold. Whenever the DFS traverses into a subtree, three breakpoints are introduced into 
the block structure, corresponding to the three discontinuities in the Haar wavelet represented by 
the subtree’s root. This happens whenever the subtree contains any coefficient above threshold, 
regardless of whether the coefficient at the root itself is above threshold. Fig. 5.1 illustrates a 
simple counterexample to compressivity. □ 

It should be noted that, while the wavelet tree stores as many sufficient statistics as needed 
for T data points, the fact that it is subcompressive implies that the block structures it creates 
differ from those of a compressive block generator, and hence these are not the same 2T — 1 
statistics that would occur in accross all block structures a compressive block generator would 
yield. Compressivity could be restored by keeping copies of the wavelet coefficients overwritten by 
the subtree maxima, which would potentially double the memory requirements for the coefficient 
array, and is thus not preferred for big data applications. 

In order to provide an efficient implementation, we separate a block generator into two 
sub-structures: a breakpoint array to derive a sequence of start and end positions for blocks, and 
an integral array to query the sufficient statistics for each block. 

5.2.1 Integral array for block statistics 

Let a data structure D(y ) support the following query: given a start index s and an end index 
e, with s < e, return the sufficient statistics in the half-open interval [s,e), i. e. T(y[t]). 
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Figure 5.1: The wavelet tree data structure is subcompressive, as it induces additional breakpoints. In 
this example, coefficients with an absolute value above the threshold are denoted as black boxes. In 
the tree layout induces by the lifting scheme, the position of a coefficient equals that of the central 
discontinuity of its associated Haar wavelet. For instance, i/> 20 has positive support on y[0],y[l], and 
negative support on y[2],y[3], with its position 2 being the lowest position of its negative support. 
The positions of the block boundaries are indicated by thin solid vertical lines, connected to their 
respective tree node by vertical lines. In this example, | d 10 1 > A, inducing block boundaries at 0, 1 
and 2, and |d lj7 | > A, inducing block boundaries at 14, 15 and 16, creating 5 blocks in total. By taking 
the maximum subtree coefficient, additional inner nodes contain values above threshold, indicated 
here by gray boxes, Traversing into subtrees below those nodes induces additional block boundaries, 
indicated here by dotted lines, at 2, 4, 8, 12 an 14. This yields a total of 8 blocks. 


A trivial implementation of such a data structure would be to store the statistics of each input 
position, and then iterate through the array and calculate their cumulative sums between 
breakpoints. This is obviously costly for huge data, as it incurs 0(1V) time complexity for a block 
of size N. Constant-time queries could be made by pre-computing all T 2 statistics, which is 
obviously prohibitive for large data. 

The basic idea for querying sufficient statistics comes from a simple data structure in image 
processing called a summed-area table or integral image (Lewis 1995), which is used to query 
the sum of a rectangular region in constant time. As its one-dimensional equivalent, let v be an 
integral array such that 

fT(0) t = 0 

Z= o ny[t]) t> 0. 


v[t] = { 
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Figure 5.2: An illustration of an integral array v, using cell size c = 4. Columns represent data 
positions, and contain all positions i which are added up and stored at v[t]; for instance, v[9] = 
Si= 9 rCy[i]). The statistics of a block [s,e) are obtained by adding v[s], v[m] for all s < m < e, 
m = 0 mod c, and subtracting v[e] iff e ^ 0 mod c. For instance, block [3,10) is obtained as 
v[3] + v[4] + v[8] - v[10], yielding £’=3 T(y[t]). 


For any arbitrary start and end positions s,e, the sufficient statistics of the block [s, e) can be 
calculated in constant time as 


e—1 f s —1 

2 r (y[t])= 


t=s 


<t =0 


HI™'") 


v[e]-v[s]. 


In contrast to image processing, where integral arrays are constructed over integer data, 
sufficient statistics require floating-point values for most distributions. Unfortunately, this incurs 
numeric problems at large data sizes. An IEEE 754 single-precision float has between 6 and 9 
significant digits. Assuming that values for sufficient statistics are on the order of 1, the further 
back a data point is in v, the more of its significant digits is used to store the sum. Neighboring 
entries will be similar or even equal, leading to catastrophic cancellation or even 0 for short 
segments. For instance, values above ~ 17 million are rounded to multiples of 2, so that even if 
each entry was 1.0, blocks of size 1 would be queried as 0. 

To alleviate this, we subdivide v into non-overlapping cells of size c, and compute partial 
cumulative sums of sufficient statistics within each cell; for convenience, we compute these sums 
from high to low indices, see Fig. 5.2. It is then easy to see that 

^T(y[t]) = —v[e], j &{s}u{i\s <i< e,i = 0 (mod c)} 

In our implementation, we used c = 2 16 = 65,536. 


Numerical issues Regardless of the data structure being used, an approach relying of sufficient 
statistics raises a general issue of numerical stability. In particular, for Gaussian emissions, 
updating the Normal-Inverse Gamma hyperparameter 


1 f E? 

( 3 ^ + -(^ 2 -- + 


Nv 


N+v\N 


-a 
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contains a numerically unstable term, 



akin to naive computation of the sample variance. As a safeguard, our implementation uses 

f E 2 

max t 0,E ? -- 

[ 2 N 

instead, in order to avoid non-positive values of /3. 

One advantage of using the wavelet tree was that the sufficient statistics were computed 
recursively bottom-up in a binary tree, an order which corresponds to a technique called pairwise 
summation. This is known to yield a relatively stable sum with at most 0(e \/log T ) roundoff-error, 
where e is machine precision (Higham 1993). 

Changing from a wavelet tree to an integral array trades this stability for a reduction in 
memory from 2T — 1 to T statistics. To counteract this effect, we use Kahan’s summation 
algorithm (Kahan 1965) for cumulative sums within each cell, so that the expected summation 
error remains bounded independent of c. 

5.2.2 Breakpoint array for block boundaries 

In order to create a block generator, the integral array has to be supplemented with a data structure 
which yields start and end positions Sj.(A), e fc (A) for subsequent blocks k. Since e fc (A) = Sj. +1 (A), 
it suffices to implement an iterator over s k for increasing k, where 

Sq 0 s k e k (?C) Sfc+i(A) 

We use a simple array of pointers to facilitate these queries: 

Definition 5.2.2 (Breakpoint array). Let b e R T be a vector of breakpoint weights , and p e l7 + be 
a vector of pointers. A data structure (b,p) is called a breakpoint array of input data y if and only if 

ft<i<t+p[t]:b[t~\>b[i ] (5.1) 

We call each interval [t,... ,p[t] — 1] a stretch at t. A breakpoint array is called maximal if 
for all T there exist no n> p[t] such that setting p[t] to n would still result in a valid breakpoint 


array. 



Breakpoint weights b[t] 
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Figure 5.3: An example of generating blocks following pointers in a breakpoint array. The top figure 
represents the input data y, the bottom figure represents the absolute wavelet coefficients, as well as 
the pointers (grey lines) and the path taken by the query (red). Whenever a value above the threshold 
(horizontal blue line) is found, a breakpoint is returned (vertical blue lines). 



0 1 2 3 4 5 6 7 8 9 10 11 12 13 


Figure 5.4: An example of a breakpoint array. Horizontal black lines represent the jumps indicated by 
the pointers, red lines represent stacks, and gray lines are for visual clarity. Upon processing index t, 
the index stack contains those elements obtained by following the red lines to the top left, starting at 
b[t]. Notice how this creates a sequence of right-to-left maxima. 
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Algorithm 2 Constructor of a maximal breakpoint array for a vector b of breakpoint weights, 
pointer array p and a maximum jump size m. S t is a deque (double-ended queue). 

l 

procedure BREAKPOINTARRAYCONSTRUCTOR(b, V, m) 

2 

pushback(S t , 0) 

> make first position pending 

3 

for t «— 1,..., T — 1 do 


4 

if |S t | > 0 then 


5 

if t — S t [front] = m then 

> check distance of farthest element 

6 

p[S t [front]] <— m 

> set farthest jump pointer 

7 

popfront (S t ) 

> mark as processed 

8 

while |S t | > 0 do 

> go through stack to find pending elements 

9 

if b[S t [back]] < b[t ] then 

t> pending elements with smaller weights 

10 

reduceStack() 

> set pending pointers and statistics 

11 

else 


12 

break 

> rest of stack has larger weights 

13 

push(S t , t) 

> make current position pending 

14 

t <h- T 


15 

while [S t | > 0 do 

> all remaining elements point to one-past-the-end 

16 

reduceStacks() 


17 

18 

function reduceStack() 


19 

i <— S t [back] 

> get closest pending index i 

20 

p[i ] *-t-i 

> set its pointer to the distance to current index 

21 

popback(S t ) 

> remove index from stack 


Proposition 5.2.2. Algorithm 2 constructs a breakpoint array in linear time O(T). 

Proof. A linear-time algorithm to calculate the pointers to the next element at least as large as 
the current one is well established in algorithmic folklore. It is modified here to use the distance 
to that element instead of a direct pointer (line 20, which would normally read p[i] <— t). The 
stack is changed to a deque to accommodate the inclusion of a maximum jump size m. The front 
of the deque is popped and its pointer set whenever it is m positions away, which happens at 
most T times. □ 

For each t, p[t ] points to the beginning of next stretch. Within each stretch, the highest 
breakpoint weight is located at its first position; when searching for weights below a given 
threshold A, once the first weight is found to be below A, all others can be safely ignored, leading 
to a simple query (Algorithm 3, demonstrated in Fig. 5.3). Given a position e fc (A) of block k 
under threshold A, the next position e fc+1 (A) can be found in O(lnlV) expected time, where 
N = ej-(A) — s^(A). In order to derive the query complexity, we require the following result: 
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Algorithm 3 Given a breakpoint weight threshold A, and a breakpoint position s ; this method 
computes the next breakpoint position s i+1 where b[s l+] ] > A It returns true if such a signal 
exists, and false otherwise to indicate that the iterator is finished. 


1 

procedure BreakpointIterator(s 

, A) 

2 

if s ; > T then 

> no more blocks left 

3 

return false 


4 

5 ;+i s i + 1 

> set potential end to next position 

5 

while s i+1 < T do 

> determine the end of the current block 

6 

if b[s ;+1 ] < A then 

> current weight below threshold 

7 

s i+ 1 5 ;+i +p[ 5 ;+i] 

> skip the following lower weights 

8 

else 


9 

break > 

higher weight than at block start determines block end 

10 

N ^s i+1 -Si 

> set current block size 

11 

return true 



Proposition 5.2.3 (Left-to-right maxima (LovAsz 1993; Knuth 1997)). For a vector x, let x[t] 
be called a left-to-right maximum of x iff Vi < t : x[i] < x[t]. Let m x count the number of 
left-to-right maximal elements in x. For a random permutation of x with \x \ = N elements, 

N 1 

E[m x ] = E -* lnlV as N —> oo. (5.2) 

i=l N 

Due to symmetry, the same result holds for minima and right-to-left extrema. 

Proposition 5.2.4. For a block of size N, the expected query complexity of Algorithm 3 is O(logiV) 
in a maximal breakpoint array. 

Proof. Since the breakpoint array is maximal, following pointers in p to find the block end creates 
a sequence of left-to-right maxima. For a block of size N, starting at t, there are M := N — 2 
elements in / :=[t + l, ...,t+N — 1] which can appear in any order, and the claim follows from 
Eq. (5.2). □ 

Proposition 5.2.5. Assume all elements in b have different values. Then the maximum expected 
stack size in Algorithm 2 is at most In T. 

Proof. Assume m = oo. An element at t is pushed whenever there exists an index j on the stack 
such that Vi = /,..., top : w [i] < w [t]. Given the smallest such j, the stacks are popped until 
top = j — 1, and w[j — 1] > w[t]. Therefore, the stack contains the right-to-left minima of 
w[l:t] after pushing index t, and the claim follows from Eq. (5.2) for t = T. For any m < oo, 
the front of the deque gets popped, thus only decreasing the stack size. □ 
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For T hg , the expected maximum stack size is < 22, a negligible overhead. We noticed that, for 
noisy data, most entries in p are much smaller than T, and using pointer-sized integers such as 
size_t in C+ + (typically 8 byte on 64-bit systems), would be wasteful. Instead, we use a 2-byte 
unsigned integer type to accommodate jumps up to m = 65,536. The resulting breakpoint array 
is not maximal anymore, but maintains its space-efficiency and compressivity, as Algorithm 3 
does not require maximality. The query overhead is minimal in practice; even in case of a single 
block for genome sized data, < 54. 

Proposition 5.2.6. Using a breakpoint array to iterate over block boundaries, which are then used 
to query sufficient statistics from an integral array yields a compressive and space-efficient block 
generator. 

Proof To prove that the breakpoint array is indeed a valid iterator over block boundaries, 
Algorithm 3 provides a constructive description of how a block is created for an arbitrary threshold 
A. Its correctness is shown by induction: Assume that the start position S; is a valid block boundary, 
so b[s,-] > A or s, = 0. The block is determined by finding the next start position s i+1 , which can 
be found by setting s i+1 := s ; + 1, and incrementing until b[s i+1 ] > A. Instead of incrementing by 
1, Eq. (5.1) guarantees we can increment s; +1 byp[s i+1 ] without missing a breakpoint (lines 4-9). 
The block size N = s l+l — s ; is then easily calculated (line 10). Since no superfluous breakpoint 
is introduced, the data structure is compressive, provided that the sufficient statistics can be 
generated correctly. Since the first block starts at s 0 = 0, the claim follow by induction. As the 
number of sufficient statistics in a breakpoint array is T, it is also space-efficient. □ 

On first sight, the query complexity of O(logiV) appears to be suboptimal compared to 
the constant-time queries in the original wavelet tree. This, however, is misleading. We have 
shown that the wavelet tree implements a suboptimal compression in the sense that it introduces 
additional block boundaries which do not correspond to discontinuities in the Haar transform. 
For a true block of size N, it creates on the order of In N different blocks. Hence, not only does 
it take In IV such queries to cover the entire block, it also increases the size of the trellis and 
hence both space and runtime requirements. The breakpoint array thus has the same amortized 
complexity as the wavelet tree, while creating better compression. 
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5.2.2.1 Haar breakpoint weights 

Having established a data structure to iterate over blocks for any given compression level, we now 
define a vector b H of breakpoint weights for the Haar wavelet transform, and derive a simple 
algorithm to calculate it from y. We need the following definitions: 

Definition 5.2.3 (Maxlet arrays). For b± k e [0, T), let 

' 

oo t = 0 V b~, > T 

J(V»j,fc,y)| t > 0 V bj k < T 
he a vector of absolute Haar wavelet coefficients, called a univariate maxlet array. For multivariate 
data, the multivariate maxlet array is the pointwise maximum of the univariate maxlet arrays for 
each data dimension. 

Notice that the univariate maxlet array corresponds to the absolute values of Haar wavelet 
transforms, generalized for arbitrary T: If T is a power of 2, it contains the absolute values 
of detail coefficient of the Haar wavelet transform with its first entry (the scaling coefficient) 
replaced by oo. For other T, it consists of a concatenation of such segments of decreasing size. 
For instance, if T = 22, the transform concatenates 3 arrays whose sizes are powers of two 
(22 = 16 + 4 + 2). Later in this chapter, we provide two algorithms: 

1. The univariate maxlet transform computes the maxlet array for arbitrary data sizes T 
in-place and in linear time. In Fig. 5.5, the dashed lines and circular markers in the 
upper subplot illustrate the necessary updates, and the white markers in the lower subplot 
correspond to the result. 

2. The multivariate maxlet transform computes the multivariate maxlet array; instead of 
computing d univariate maxlet transforms separately, using a space of 2 T (one array of size 
T for calculating the transform at the current dimension, another for storing the maxima), 
we show that this transform can be computed using only T + d Id T space and O(dT) time. 
For d = 1, the result is the same as the univariate maxlet transform. 

In a maxlet array, each value at t corresponds to the (maximum of d) detail coefficients for 
which the central discontinuity b± k = t. In order to decide whether or not a wavelet regression 
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under a threshold A has a discontinuity at t, all wavelets with bj k = t and b . fc = t have to be 
taken into account. The breakpoint weights are thus computed as follows: 


Definition 5.2.4 (Haar breakpoint array). For b± k € [0, T ), let 

r -i I 00 

K[bt k ] = 


t = 0 V b~ >T 

J,k 


max 


f,k' {}| b tk = b r,k ' v b t,k = b y,k'} f > 0 v b j,k < T 


be a vector of breakpoint weights. A breakpoint array initialized with these weights is called a Haar 
breakpoint array. 


Here, the entry at each position t is set to the largest absolute coefficient for all wavelets 
which have their left, central or right discontinuity at t, as long as the wavelet for which b± k = t 
has full support in [0, T). The reasoning behind this definition is that if T is a power of 2, then 
this breakpoint array introduces the same breakpoints as wavelet compression for the same 
threshold, with probability 1; a breakpoint is only potentially hidden in cases where b~ k = bt fc+1 ; 
on M, this has probability measure 0. For other data sizes, there are wavelets which have their 
central discontinuity within the data, but have incomplete support. Instead of using one of several 
padding schemes, we assume those breakpoints to have infinite weight, which introduces up to 
Id T additional breakpoints. 

If the emissions are multivariate with dimension d, the set of block boundaries is the union 
of block boundaries across all dimensions. In other words, the corresponding weights for the 
Haar breakpoint array can easily be calculated as the per-position maxima of across the d such 
weight vectors, as given by the multivariate maxlet array. 

To derive Haar breakpoint weight from any maxlet transform, we introduce the Haar boundary 
transform, which performs the necessary maximum computations in-place and in linear time 
(XT’). In Fig. 5.5, the solid lines in the upper subplot illustrate the computation of maxima, and 
the black markers in the lower subplot correspond to the result. 


5.2.2.2 In-place univariate maxlet transform 

In-place calculation of b M requires an in-place generalization of the Haar wavelet transform 
for arbitrary-size data. The pyramid algorithm used in our original approach was obviously not 
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in-place, as it required padding of the input array to a power of two, as well as an additional 
array of size T for temporary storage. In order to derive an in-place algorithm to calculate b M , 
we use a more recent in-place calculation of the Haar wavelet transform based on the lifting 
scheme (Algorithm 4), due to Sweldens (1995, 1998). It is based on the following recursions: 


c i,k ■= i 


y[k] 

,b~,~ i 

’m 


j = 0 


S t f fa+ y[t] = Z t i + y[t] + S t i^ y[t] = C j-l,2k + c j-l,2k+l J > 0 

^ i-k j,k j,k 

f h %~ 1 


■*-j,k * 


V2J 


u . 1 x b.,— 1 


S S rtf] 


t=b ± , 

J,k 


V2J 


d c j- 


1,2 k 


+ c ;-1.2fc+l) 


These relations are illustrated in Fig. 5.5 using dotted edges, with dj k = Wj k and c o fc = = y [fc]. 
By storing Cj^ at index bt fc and dj ^ at index , they derive an in-place algorithm which never 
overwrites d } k once it is calculated (Algorithm 4). Notice that each detail coefficient d ; - is 
stored at the position b± k corresponding to the central discontinuity in their corresponding 
wavelet, and that this corresponds to an in-order DFS layout of the wavelet tree without the 
leaves corresponding to the input data, with the leftmost leaf at index 1 (Fig. 5.5, bold lines); 
the tree is created from the leaves up, and from left to right. 


Algorithm 4 In-place Haar wavelet transform for power-of-2 data sizes (Sweldens 1995, modi¬ 
fied for expositional purposes). 


1 

procedure HAARTRANSFORM(y) 


2 

t «- lyl 

> number of data points 

3 

for j <— 1, ...,ldT do 

> iterate over levels, bottom-up 

4 

N ^ V 

> support size of ip 

5 

c $ — 

Tiv 

> normalization constant 

6 

for k <— 0,..., ^ — 1 do 

> process elements on level j from left to right 

7 

L <— Nk 

> left index 

8 

H <— iV(fc + j) 

> right index 

9 

y[L]<-y[L] 

> copy Cj_i 2 k 

10 

y[R] «- y M 

> copy Cj_ 12k+1 

11 

y[L]^y[L] + y[R] 

> calculate Cj k 

12 

y[R]^s(y[L]-y[R]) 

> calculate dj k 

13 

return y 



We first provide a generalization of the in-place Haar wavelet transform to arbitrary data 
sizes (Algorithm 7). 


Proposition 5.2.7. b M can be computed in-place and in linear time O(T). 
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Algorithm 5 Given emission data y, compute the univariate maxlet transform, i. e. the absolute 
detail coefficients of the Haar wavelet transform for arbitrary data sizes T. 


1 

procedure MAXLETTRANSFORM(y) 


2 

T <- |y| 

> number of data points 

3 

for j <— 1, ..., 1" Id T] do 

> iterate over levels, bottom-up 

4 

N<- 2> 

> support size of if j k 

5 

c J_ 

-In 

> normalization constant 

6 

for k<- — 1 do 

> process elements on level j from left to right 

7 

L <— Nk 

> left index 

8 

R <— JV(k + |) 

> right index 

9 

if R < T then 


10 

y[L]^y[L] 

> copy Cj_ h2k 

11 

y[k] <— y [R] 

> copy Cj_ 12k+1 

12 

y[L]^y[L]+y[R] 

> calculate Cj ^ 

13 

y[R] <-s|y[L]-y[R]| 

> calculate | dj k | 

14 

else 


15 

y[L] «- oo 

> force breakpoint for incomplete support 

16 

return y 



Proof. b M can be considered as the concatenation of a minimum number of maxlet transforms 
b p M of decreasing sizes which are all powers of 2. Note that b M [t] = oo whenever b p M [ 0]. Since 
the support size for any Haar wavelet is a power of two, an infinite value indicates that the value 
associated with this position has incomplete support across the data. Assume w.l.o.g. that T is a 
power of two and hence the tree is complete. The Haar wavelet transform can be implemented 
in-place in linear time (Sweldens 1995), yielding a DFS in-order layout of the wavelet tree, with 
the first entry representing the sum of all data points. The first element can be set to oo and the 
other values can be set to their absolute value in a second pass. Alternatively, the transform can 
be calculated for arbitrary data sizes in one pass (Algorithm 7, which is a modification of the 
lifting scheme with checks for incomplete support intervals). □ 


5.2.2.3 Multivariate maxlet transform 

Naively, the maxlet transform for multivariate data can be computed using 2 T space, by succes¬ 
sively computing the univariate maxlet transform for each dimension d in an array of size T, 
and continuously updating the maxima across dimensions in a second array of the same size. 
Here, we derive an alternative version requiring only an array of size T as well as a stack of size 
0(d Id T). Essentially, it is an adaptation of the lifting scheme, which changes the order of com- 





100 


putation (Algorithm 6). For an inner node in the tree of detail coefficients to be computed, all its 
descendants have to be known. In a streaming setting, the leaves arrive ordered by t. Therefore, 
the lifting scheme has to be changed from bottom-up BFS to DFS post-order, necessitating the 
use of a stack. 

The algorithm maintains the following invariant: when processing the detail coefficient dj k at 
position p, the stack S contains the unnormalized scaling coefficients (i. e. sums of data segments 
of size 2 J ) Cj_ 1)2 k+i for all dimensions at the top D elements, followed by the set of unnormalized 
scaling coefficients for c J -_ lj2 fc+i at the D positions below. Using a stack that allows random access, 
such as the vector class in C+ + , detail coefficients the maximum dj k across all data dimensions 
can be computed from the 2D top elements of the stack, using the lifting recursions and the 
level-specific normalization factor n (Algorithm 6, Line 16-Line 25). Afterwards, by adding the 
top D entries in the stack to the next D entries element-wise and popping the top D elements, 
the stack contains the dimension-wise sums for the joint support ranges, i. e. the unnormalized 
scaling coefficients to compute the left parent of the current node (the parent index and checks 
whether such a parent exist for the current position is maintained in the index p and a bitmask 
m reflecting the structure of the tree). For instance, if the stack contains c 22 and c 2 3 , replacing 
c 2,2 W c 2,2 + c 2,3 an d popping c 2; 3 yields c 31 on top of the stack, so d 4 0 can be calculated, since 
the elements below the stack top are the scaling coefficients c 3 0 , see Fig. 5.5. Since the behavior 
of the stack mimics that of a DFS traversal, it never gets larger than d Id T. 

5.2.2.4 Haar boundary transform 

For each central discontinuity bj k , there are j — 1 wavelets at lower levels j' < j, which have 
their left discontinuity at that position (f>+ fc , = bt fc ). The same is true for right discontinuities, 
so that each position is the maximum of 2 j — 1 absolute wavelet coefficients. In Fig. 5.5, these 
relations are indicated in that each node in the tree is transformed into the maximum of itself as 
well as direct descendants in lower levels indicated by solid lines (including tree edges). 

We will later show that despite those dependencies, the Haar breakpoint array can be 
calculated in-place in linear time for arbitrary-sized input data y. 

The linear running time for maximum calculation can be established non-constructively: 
The central discontinuity of a wavelet at level j is shared with j wavelets for which it is the left 
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Algorithm 6 Computes the univariate or multivariate maxlet transform, i. e. for each position t 
and D Haar wavelet transforms of size T the maximum of the D absolute wavelet coefficients at 
position t is computed. The input is assumed to come from a data stream F, sorted by (t,d), 
0<t<T,0<d<D. The algorithm requires T + D Id T space and time. 

l 

procedure StreamingMaxletTransform(F, D) 

2 

Allocate empty vector b 

> Contains resulting maxlet transform 

3 

Allocate empty stack S 

> Contains unprocessed data sums 

4 

t <- 0 

> Current data index 

5 

d <—0 

> Current dimension index at position t 

6 

while F not empty do 


7 

Get next element from F and push it to S 

8 

d + + 

> Dimension index of next element 

9 

if d = D then 

> Finished reading all dimensions for this position 

10 

d 0 


11 

Append oo to b 

> Coefficients are oo for incomplete support 

12 

P*~t 

> Index of detail coefficient in upward-left path (DFS) 

13 

m <— 1 

> Bit mask for accessing left parent node, if it exists 

14 


> Wavelet normalizer 

15 

while (p © m) > 0 do 

> p is a left parent in tree 

16 

c <— 0 

> Maximum detail coefficient across dimensions 

17 

L «- \S\-2D 

> Lower stack index 

18 

R^L+D 

> Upper stack index 

19 

for D iterations do 


20 

c <— max{c, n|S[L] - 

- S[R] } > Update maximum at j 

21 

S[L] ^ S[L] + S[R] 

> Store sum for next level 

22 

L H — K 

> Next dimension for lower stack elements 

23 

R + + 

> Next dimension for upper stack elements 

24 

b[p] <— c 

> Store maximum absolute detail coefficient 

25 

n ^T2 

> Update normalizer for next level 

26 

Pop D elements from S 

> Top now contains subtree sum 

27 

p <— p — m 

> Move to potential left parent 

28 

m <— 2m 

> Shift bit mask 

29 

t + + 



discontinuity, and likewise for the right, hence each wavelet coefficient at level j needs to be 
updated by 2 j coefficients at lower levels. There are coefficients at level j, hence the total 
number of updates is 



is monotonically increasing. 
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Algorithm 7 Given a maxlet transform d, for each position t compute the maximum absolute 
coefficient of all wavelets which have a discontinuity at t, in-place and in linear time. 

l 

procedure HAARBOUNDARYTRANSFORM(d) 


2 

d[0] <— oo 

> force breakpoint before first element 

3 

for j <— [ Id T J,..., 1 do 

> iterate over levels, top-down 

4 

N *- 2i 

t> support size of wavelet t level j 

5 

for 0, — 1 do 

> process elements on level j from left to right 

6 

t <-N(k+ \) 

> index of central discontinuity bp k of ip j k 

7 


> distance to left and right discontinuity 

8 

if t < T then 


9 

L <— t — n 

> index of left discontinuity 

10 

d[L] <— max{d[L],d[t]} 

> consider left discontinuity 

11 

R^- t + n 

> index of right discontinuity 

12 

if R < T then 


13 

d[R] <— max{d[R],d[t]} 

> consider right discontinuity 

14 

return d 



At each position, left and right discontinuities of multiple wavelets have to be taken into 
account to calculate the maximum, except for positions between trees, which are already oo. 
Two versions of this approach can be used: one can traverse through the tree in a BFS fashion, 
and update bp k from the bp fc , and b l ( , k , at lower levels l' < l. Alternatively, the same updates 
can be performed traversing top-down through levels l ', updating weights at higher levels for 
left and write discontinuities. Each wavelet is considered at most twice for updating a wavelet 
on a higher level. Since higher-level weights are updated from lower levels, and the traversal is 
top-down, no weights are overwritten prematurely and the algorithm is in-place. 


5.3 Compressed marginal records 

Definition 5.3.1 (Marginal records). Let t e [0,..., T), s max the largest state sampled during FBG, 
and s G [0,... ,s max ]. A marginal record is a data structure which allows to store and query the 
number of times state s was observed at data index t. 

Our previous solution to recording marginal state counts was closely tied to the wavelet tree 
data structure in that it stored count vectors in the nodes of the tree. This was inefficient for 
a number of reasons: firstly, as discussed earlier, the number of nodes is larger than necessary. 
Secondly, the memory for each of these nodes has to be allocated. If the tree is implemented 
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as a flat array of size 2 T, the allocation requirements for k states are 277c, even though a lot of 
nodes will potentially not contain any counts, and even those that do will contain zeros for many 
states. Such a preallocation approach also requires the number of states to be known in advance, 
and precludes further extensions to priors on the state number such as the Dirichlet Process. 

We had therefore opted for a second approach, in which counts are dynamically allocated 
in each wavelet tree node using C + + vectors, such that the i-th position contains the count 
of state i. Vectors were dynamically increased in size to accommodate counts for the highest 
numbered state in each iteration. However, even with this approach, there was considerable 
overhead for enabling such dynamic allocation, in this case 20 B per node for holding start, end, 
and reserved memory size for vectors. For a billion data points, this alone results in 40 GB of 
additional RAM. 

This could be somewhat alleviated by using a run-length encoding (RLE) approach, in which 
state counts are recorded for each compression segment and stored along with the segment 
length, illustrated by the right column of Fig. 5.6. 

Dynamic compression however complicates the use of run-length encoding for marginals. At 
each new iteration, a different block structure is created, which requires existing RLE segments 
to be split into multiple parts, each of which will have counts for a different state added. This 
could be solved trivially using a linked list implementation, in which new segments are inserted 
with the appropriate updates of its neighbors size. This approach however has two disadvantages. 
Firstly, maintaining the pointers in the list is wasteful in terms of memory. Second, in order 
to hold the state count record, either a fixed array needs to be allocated, which is wasteful if 
the number of states is large, as it will contain mostly zero-entries, or some kind of dynamic 
data structure which encodes a compressed version of the state counts is required, which incurs 
additional memory due to additional householding variables. On the other hand, states could 
be stored in a resizable array like C++’s vector. It could be increased in size to hold the new 
number of run-length segments, and move existing items so as to free the necessary insertion 
positions. While this would only incur linear-time overhead, it requires the insert positions to 
be known with random-access. However, we do not explicitly store the block sizes for a state 
sample, as this incurs large memory usage, and obtain them by iterating over the breakpoint 
array instead. In this case, every entry after a new segment position would have to be moved 
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Figure 5.6: A small three-step example of recording marginal counts using Algorithm 8. Starting 
at position t = 0, 7 observations of state 5 are inserted. In the count queue, black boxes indicate 
that state counts of zero have been skipped; those numbers encode the next higher state that has a 
non-zero count. White boxes indicate the counts for the state. For instance, the right-most part of 
the count queue in the top subfigure is stored as (0,-1,—2,4,—7), indicating that there is 1 count 
for state 0, 2 counts for state 1, and 7 counts for state 4. The segment starts at position t = 9, and 
has a length of 1. Note that 0 is used to mark the start of a new segment. Each segment has a total 
of 10 counts already recorded. Arrows indicate contiguous elements in the count queue. With every 
iteration, a segment is moved to the back with the new state count included. Note that in the last 
iteration, the segment t = 6,..., 8 is split. After finishing this step, the next count would be recorded 
starting at position t = 7. Notice how each run of zeros in the state queue is represented by a single 
number, thus allowing for arbitrarily large state indices without much overhead. 
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further back in the vector, so that the reallocation complexity for vector size n is on the order 
of 0(n 2 ) in each FBG iteration. Furthermore, this approach requires the marginals to be stored 
as individual data structures with the same memory overhead described above. If, on the other 
hand, the encoding was sequential so as to fit into a single array, determining the positions of the 
segments to be moved is challenging. 

We developed an encoding for marginal records that stores counts sequentially in a vector 
of integers in a highly compressed fashion with small overhead. Adding records for run-length 
encoded state sequences is performed using a queue with iterator access to its front elements, 
such as implemented by deque, and requires a single pass over the state records and is therefore 
linear. The memory overhead is 2 bytes per segment, plus one bit for every 32 integers. 

Encoding for marginal counts for a single position is performed using a sequence c of signed 
integers. A negative number is used to store the counts for a state. The state s(i) of a position i 
is recursively defined as 

5(0) = 0, 

r 

s(i — 1) c[t — 1] < 0 

5(0 := 

c[t — 1] c[t — 1] > 0 

Positive entries are called index values. We further require that all index values must be in strictly 
increasing order, and that no unnecessary index is used, i. e. we require 

Vc[i] > 0 : s(i — 1) + 1 < c[t]. 

In other words, runs of states having observed counts are represented as runs of negative numbers, 
and runs of zero-counts are represented as a single number indicating the state label of the next 
higher state with non-zero counts. For instance, the count vector 

( 2 , 0 , 0 , 8 , 1 , 4 , 0 , 0 , 0 , 0 , 5 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) 

would be encoded as 

(-2,3,-8,-1,-4,9,-5), 

and the corresponding states are 


( 0 , 1 , 3 , 4 , 5 , 6 , 9 ), 
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Iteration 


400 600 

Iteration 


1000 


Figure 5.7: Micro- (left) and macro-averaged F-measures for the improved implementation of HaMM- 
LET, using integral array, breakpoint array and the streaming maxlet transform. For comparison with 
the previous implementation results, see Fig. 4.10 on page 80. 


though 1 and 6 are somewhat inconsequential as they have no counts associated with them; note 
that the decision to use negative signs for counts instead of index values is arbitrary in principle, 
but leads to using fewer negations in the implementation. In settings where quick convergence 
is expected, the number of zeros is expected to be high, leading to good compression under 
this scheme. In general, assume that the marginals contain M distinct segments after running 
FBG, and the HMM has S states. Then, the queue can contain no more than (2 S + 1 )M entries: 
for each segment, one zero to mark the beginning of a segment, and up to one positive and 
negative value per state. If the number of latent HMM states is limited to S, then there can be 
no more than S non-zero entries per segment. Hence, for reasonably high compression ratios, 
this amounts to small memory usage. For instance, at a compression ratio of 300 for a human 
genome at base-level resolution and 10 latent HMM states, marginal records using 2-byte signed 
integers require less than 234 MB. In practice, not every segment will contain 11 values, due to 
fast convergence, and the numbers get even smaller. Compared to the storage requirements of 
the block generator, this is negligible. 
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5.4 Evaluation 

Changing the compression scheme raises the question of its effects on the inference quality. In 
particular, a stronger compression allowing for fewer path changes might have two competing 
effects. On the one hand, the decreased freedom in sampling the state sequence might result 
in slower mixing. On the other hand, assigning a state other than the MPM state to a block 
might yield posteriors which could prolong the convergence of said state parameters to their 
MPM value. In order to evaluate which of these effects is stronger, we rerun our previous 
simulations, using the exact same files created earlier and deposited on Zenodo. Fig. 5.7 shows 
the micro- and macro-averaged F-measures for our new implementation. Results are virtually 
identical, thought the macro-averaged F-measure appears to converge slightly slower for the lower 
quantiles. The slower mixing effect appears to be slightly stronger, yet negligible. Benchmarking 
results comparing the prototype and improved implementations of HaMMLET can be found in 
Section 6.4 



108 


Algorithm 8 Append N observations of state s 

to the marginal records. 

1: 

procedure AddMarginalRecord(N, state) 


2: 

bool done <— false 


3: 

s <-0 

> State at current index 

4: 

while IV > 0 do 

> Consume entire length of insert block 

5: 

s <— 0 

> Assume lowest state 

6: 

done <— false 

> Remember if insert was successful 

7: 

for t *— 0 t < couNTQ.size() ++t do 


8: 

ENTRY <— COUNTQ [t] 

> Entry at current position 

9: 

if entry = 0 then 

> All entries for this segment have been consumed 

10 

if -i done then 

> Count not registered yet 

11 

if s < state then 

> Skipped over states? 

12 

COUNTQ.pUSh( STATE ) 

> Add state index 

13 

COUNTQ.pUShf —COUNT ) 

> Append count as negative number 

14 

COUNTQ.pUShf 0 ) 

> Append 0 to mark end of this segment 

15 

break 


16 

if done then 

> Count was successfully registered 

17 

COUNTQ.pUShf ENTRY ) 

> Keep appending all entries for this segment 

18 

continue 


19 

if entry > 0 then 

> Entry denotes state of next entry 

20 

if state < entry then 

> Count must be inserted here 

21 

if s < state then 

> Skipped over states? 

22 

COUNTQ.push( STATE ) 

> Add state index 

23 

COUNTQ.pUshf -COUNT ) 

> Append count as negative number 

24 

if state + 1 < entry then 

> Skipping states until next count? 

25 

COUNTQ.push( ENTRY ) 

> Add state index 

26 

done <— true 

> Mark count insertion as successful 

27 

else 


28 

COUNTQ.pUShf ENTRY ) 


29 

S <— ENTRY 

> Update state for next entry 

30 

else 

> Entry is negative count or current state 

31 

if s = state then 

> Reached target state to be counted 

32 

COUNTQ.pUShf ENTRY — COUNT ) 

> Add count 

33 

done <— true 

> Mark count insertion as successful 

34 

else 


35 

COUNTQ.pUShf ENTRY ) 

> Append entry without further action 

36 

S+ + 

> Update current state 

37 

if N < size Q.frontQ then 

> Residual front segment remains 

38 

sizEQ.pushf N) 

> Assign its size to new segment 

39 

sizEQ.frontf) <— size Q.frontQ — N 

> Decrease size to remainder 

40 

break 


41 

else 

> Front segment was completely absorbed 

42 

sizEQ.pushf sizEQ.front() ) 

> Associate its size with new segment 

43 

N <— N- sizEQ.front() 

> Set insert size to its remainder 

44 

sizEQ.popO 

> Remove empty segment’s size 

45 

while count Q.frontQ ^ 0 do 

> Remove all entries for mpty segment 

46 

countQ. pop () 


47 

countQ. pop() 

> Remove segment separator 
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Chapter 6 

Application to WGS data 


In this chapter, we demonstrate the applicability of HaMMLET to a real-world research question. 
We show that CNV candidates inferred using our methods are enriched for gene annotations 
consistent with a working hypothesis that CNVs play a role in the domestication syndrome. 

6.1 CNV as a genetic basis for domestication effects in rats 

The domestication of a handful of animal species, starting in the early Holocene, has played a 
crucial role in the development of complex human societies (Diamond 1998). While we have 
learned a great deal about when and where animal domestication occurred, the genetic changes 
that underlie the phenotypic differences between domestic animals and their wild progenitors 
remain relatively unknown. It has been observed that domestic animal species tend to share a 
suite of behavioral, physiological and morphological traits that are absent or rarely observed in 
their wild progenitors (Darwin 1868; Wilkins, Wrangham & Fitch 2014). These traits include 
changes in pigmentation, craniofacial anatomy, hormonal levels, seasonal reproduction cycles 
and increased docility (Sanchez-Villagra, Geiger & Schneider 2016). This suite of changes is 
referred to as the “domestication syndrome”. A long-standing question in evolutionary biology is 
whether these convergent changes are the result of genetic drift, artificial selection by humans for 
each individual trait, or pleiotropic effects of selection for a few or even a single trait. A proponent 
of the pleiotropy hypothesis, i.e. that genes under selection for behavioral traits of tameness 
and aggression influence other, seemingly unrelated phenotypes, was the Academician Dmitry K. 
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Belyaev. He hypothesized that selection for tameness at the start of the domestication process 
had pleiotropic effects that explained many of the features of the domestication syndrome. To 
test his hypothesis, he began a program of experimental domestication of the silver fox (Vulpes 
vulpes) in 1959 in Novosibirsk, Siberia. Foxes obtained for fur farms were selectively bred for 
their behavioral response to an approaching human. One line of foxes was bred for tame behavior 
towards humans, while a control line was selected for a fearful and aggressive response towards 
humans, to maintain the wild-type behavior despite being maintained in captive conditions. 
After just a few generations of selective breeding, the tame line began to show many of the traits 
associated with the domestication syndrome, including changes in pigmentation, morphology 
and behavior (Belyaev 1969; Trut, Plyusnina & Oskina 2004; Trut, Oskina & Kharlamova 
2009). 

The same experimental setup of artificially selecting two lines, one for tame and one for 
fearful and aggressive behavior towards humans, was also repeated by the same research group 
in the brown Norway rat (Rattus norvegicus) with similar results (Albert et al. 2008). These 
results seem to confirm Belyaev’s hypothesis that selection for tameness alone could explain 
many of the features of the domestication syndrome. However, the specific genetic changes that 
underlie these changes remain unknown. Knowledge of the genetic variants that have been 
selected in these lines could lead to mechanistic insights into the domestication process. Genomic 
structural variants are of particular interest, as they are known to have played a role in the 
adaptation of other domestic animals (Axelsson et al. 2013), and structural variants that affect 
multiple functional genomic loci are one possible explanation for the rapid response to selection 
observed in these lines. To address this issue we analyzed whole-genome data that was generated 
from multiple individuals from the tame and aggressive lines of rats. For an overview of the 
experimental process described below, see Fig. 6.1. 

6.2 Sample origins and data generation 

DNA samples were obtained from two rat lines originating from a shared wild source population, 
subsequently maintained in isolation and divergently selected for ~70 generations for their 
behavioral response to humans. 20 samples were obtained from the tame line, which has been 
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selected for a reduced fear response towards an approaching human hand. 20 samples were 
obtained from the aggressive line, which has been selected for an increase in fearful and aggressive 
behavior towards an approaching human hand. DNA extraction was carried out at the Institute 
of Cytology and Genetics, the Siberian Branch of the Russian Academy of Sciences, Novosibirsk 
and at the Max Planck Institute for Evolutionary Anthropology (MPI-EVA), Germany. 

For all samples, sequencing libraries were generated consisting of 125 bp double-indexed 
paired-end reads. Samples were pooled into a single library in order to avoid any batch effects 
during sequencing. Sequencing was performed on a combination of the Illumina Genome Analyzer 
II and High-Seq platforms. Library preparation and sequencing was carried out at the MPI-EVA. 
The rats have a mean coverage of ~4X per individual, meaning that each genomic position is 
expected to be covered by 4 reads. Base calling was done using freelbis (Renaud, Kircher, et al. 
2013). Adapters were removed and potentially chimeric sequences flagged using the software 
leeHom using default parameters (Renaud, Stenzel & Kelso 2014). Reads were demultiplexed 
using deML using default quality thresholds (Renaud, Stenzel, Maricic, et al. 2015). Reads 
were then mapped to the Rattus norvegicus reference assembly rno5, using the Burrows-Wheeler 
Aligner (BWA) with default parameters (Li & Durbin 2009). Duplicate read removal was 
performed with Picard (http://broadinstitute.github.io/picard/). Local indel realignment was 
performed using the Genome Analysis Toolkit (GATK) (McKenna et al. 2010; DePristo et al. 
2011; Van der Auwera et al. 2013). 

6.3 Data prepocessing 

We used SAMtools to process the BAM files resulting from the read mapping procedure. Lowest 
mapping positions were recorded for each read, and their counts were accumulated. Start counts 
for the tame population were subtracted from their counterparts in the aggressive population, 
yielding 1,880,703,547 data points. Note that, due to the multiplexed sequencing, any additive 
bias cancels out during subtraction. We used lowest mapped positions per read instead of per-base 
coverage (pileup data) for two reasons. Hidden Markov Models assume conditional independence 
of the observed data points given the state sequence. However, using pileup creates statistical 
dependence between neighboring positions due to reads covering multiple bases. Secondly, 
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pileup data tends to produce smooth curves. Since the Haar wavelet only has one vanishing 
moment, the local curvature of the data tends to produce large coefficients which severely reduces 
the compression. Additionally, sufficient statistics of blocks that are being created are hard to 
interpret, and the noise violates the Gaussian assumption. Instead, using only one position for 
each read decorrelates neighboring data points. 

Due to the low coverage and the integer nature of the counts, the data showed highly 
discrete noise, and hence the data was averaged over non-overlapping windows of 20 positions 
to approximate Gaussian noise, resulting in 94,035,178 input positions. We then ran HaMMLET 
with 8 CNV states and automatic priors, see Wiedenhoeft, Brugel & Schliep 2016c. An 
example plot from this study showing complex CNV can be found in Fig. 6.2. 

6.4 Benchmarks 

On a computer with Intel Xeon CPU E7-8890 v4 (2.20 GHz) and 1TB RAM, running Ubuntu 
14.04.5 LTS, full Bayesian inference with HaMMLET for 200 iterations with a burn-in of 1,800 for 
an 8-state-model required 3 min 41 s and 1.3 GB RAM. By comparison, the previously published 
version of HaMMLET took 1 h 5 min 27 s, using 40 GB RAM (cf. Figure 6.3). 

For a broader evaluation, we have created 100 replicates of the data by splitting it into 2500 
chunks of equal sizes, which we then permuted randomly. We measured the memory usage 
(maximum resident set size), running time as well as cache behavior (minor page faults), see 
Fig. 6.4. The smaller savings in runtime compared to the original data can be attributed to the 
fact that permutation of the data is likely to disrupt long, highly compressible sections of the 
data. Also, with smaller blocks created under optimal compression, the undercompression effects 
in the wavelet tree might be less pronounced, hence its lower average runtime compared to the 
original data. 

While the RAM usage remains almost constant among replicates within each implementa¬ 
tion, we noticed that runtime and cache behavior varied widely in the old, but not the new 
implementation. We attribute this to the fact that the old compression scheme is suboptimal, 
yielding smaller blocks and hence more randomized assignment to states, leading to slower 
mixing properties of the Gibbs sampler. The data wavelet tree data contains outliers not shown 
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Figure 6.3: Comparison of benchmarks for running time, memory usage and cache behavior between 
the old and new versions of HaMMLET on the rat population WGS data set. The new approach yields 
a 17.8-fold speedup and 32.2-fold memory reduction. Notice that the number of minor page faults 
decreases by five orders of magnitude, indicating much better cache behavior due to the use of new 
data structures and an improved implementation. The number of major page faults is zero in both 
implementations. 
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Figure 6.4: Comparison of benchmarks for running time, memory usage and cache behavior between 
the old and new versions of HaMMLET on 100 permutations of the rat population WGS data set. 
Notice the decrease in variance for both page faults and running time. 


in the figure, most notably a runtime instance of 6.4 h, which is likely to result from sampling 
small emission variances due to short compression blocks. 


6.5 Results 

We consider all genomic segments with an absolute state mean > 1 as containing putative 
structural variation segregating between the tame and aggressive rat lines. This results in 
10,083,374 regions with a mean size of 407 base pairs. We identify all genes that are within or 
overlap these regions by > 1 base pair using Ensembl’s Variant Effect Predictor (McLare et al. 
2016). We find 1,036 genes with at least partial overlap with these regions. 

To investigate the potential phenotypic consequences of these structural variants we performed 
a gene enrichment analysis using the software Webgestalt (Zhang, Kirov & Snoddy 2005; Wang, 
Duncan, et al. 2013). We tested for enrichment of gene ontology (GO term) categories for all 
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genes overlapping these structural variants using all genes in the rat genome as background. We 
consider as significantly enriched all pathways with p-value <0.05 after using the Benjamini and 
Hochberg procedure to correct for multiple hypothesis testing (Benjamini & Hochberg 1995). 
We identify many significantly enriched pathways, see Appendix A. We now briefly discuss some 
of these pathways and the genes within them and how they may inform us about the genetic 
changes underlying the phenotypic differences between these lines. 

The most significantly enriched pathway is “Synapse assembly” (p-value = 0.0028), with 
five genes that are in putative structural variants segregating between the tame and aggressive 
rat lines. Some of these genes are associated with phenotypes that may be involved in the 
behavioral differences observed between the tame and aggressive rat lines. For example, one of 
the genes is the neuronal cadherin gene Cdh2. Missense mutations in this gene are associated 
with obsessive-compulsive behavior and Tourette disorder phenotypes in humans (Moya et al. 
2013) and this gene has been associated with anxiety in mice (Donner et al. 2008). Another 
gene encodes the ephrin receptor Ephbl. The ephrin receptor-ligand system is involved in the 
regulation of several developmental processes in the nervous system. Notably, mice with null 
mutations for this gene exhibit neuronal loss in the substantia nigra and display spontaneous 
locomotor hyperactivity (Richards et al. 2007). This is interesting given that the tame and 
aggressive rats have differences in their activity in an open-field test (Albert et al. 2008). 

We also observe multiple additional enriched pathways involved in neuronal development 
and function, e.g. “transmission of nerve impulse”, “regulation of neurological system process”, 
“dendrite morphogenesis”. Therefore, we suspect that many of these segregating structural 
variants may have been targeted by selection and are contributing the phenotypic differences 
between these lines. Pending experimental validation, future study of the variants identified 
here may lead to insights into the domestication process. 
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Chapter 7 

Conclusion 


In this thesis, we have presented a novel approach for HMM-based Bayesian segmentation of 
large-scale data for the inference of genomic copy-number variants. Using a decision-theoretic 
framework, we have shown that for homoscedastic Gaussian emissions the discontinuities in a 
Haar wavelet regression based on the universal threshold capture strong changes in the posterior 
state distribution. This allows for a compression of the data into blocks of sufficient statistics within 
which the data is likely to be emitted from the same latent state. We derived the bias incurred 
in the forward-variables by ignoring between-state transitions for general HMM, generalizing 
earlier results based on the weak path assumption. The tendency of this forward bias to favor the 
state with the highest marginal likelihood within each block made us conjecture that Forward- 
Backward Gibbs sampling would result in a maximum posterior margins (MPM) segmentation of 
the data. 

For heteroscedastic Gaussian HMM, we developed a Forward-Backward Gibbs sampling 
scheme based on dynamic Haar wavelet compression, in which the universal threshold is derived 
from the smallest emission variance sampled in each iteration. We developed the wavelet tree as 
a data structure to facilitate this sampling, and used extensive simulation studies to demonstrate 
the performance of our method. We show that compression greatly improves convergence, in 
terms of the number of sampling iterations, towards the true latent state sequence compared 
to uncompressed sampling, thus providing experimental support for our MPM segmentation 
conjecture. This faster convergence is achieved while also improving overall running time by two 
orders of magnitude. We also demonstrated the performance on biological gold-standard data 
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sets, improving upon the state of the art methods while running substantially faster. 

We then showed that the wavelet tree yields a suboptimal compression, and derived a new 
compression scheme. We replaced the wavelet tree by more efficient and less memory-consuming 
data structures called breakpoint array and integral array to facilitate fast dynamic queries of the 
block sufficient statistics under arbitrary thresholds, without recomputing the Haar transform at 
every iteration. For an efficient constructor of these data structures, we developed a linear-time, 
in-place algorithm called maxlet transform to compute the maximum absolute detail coefficients, 
as well as an equivalent algorithm for multivariate data which incurs only logarithmic overhead in 
a streaming setting. We also developed a linear-time, in-place algorithm called the Haar boundary 
transform to compute the maximum of all detail coefficients affecting the discontinuities in the 
Haar wavelet regression at each position given arbitrary thresholds. Furthermore, we provided 
an efficient, queue-based data structure to store and update posterior state marginals during 
Gibbs sampling. Beyond a proof of concept, we have released our implementation in a software 
called HaMMLET under an open-source license at https://github.com/wiedenhoeft/HaMMLET. 
We have demonstrated that large-scale HMM inference can be performed on consumer hardware 
within a few minutes. We have demonstrated that plausible CNV candidates can be found using 
large-scale whole-genome sequencing data, and derived new scientific hypotheses about the role 
of CNV in the domestication syndrome. We are confident that HaMMLET is widely applicable to 
experimental data from current and future CNV technologies. 

While we conjecture that HaMMLET achieves an approximation of MPM segmentation based 
on theoretical considerations and experimental results, the approximation of true posterior state 
margins remains an open challenge. Corrections for forward bias are not straightforward under 
constantly changing emission parameters during Gibbs sampling. Such a correction would also 
lead to the paradoxical situation that compression assumes exchangeability within each block, 
yet one of the fundamental properties of HMM is the ability to capture non-exchangeability of 
consecutive observations. We have ignored this question in the scope of this thesis, since real-life 
applications of CNV detection are often consistent with the weak path assumption. For more 
general applications, careful consideration should be given to those questions. 
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Appendix A 

Significant pathways in rat populations 


In this appendix, we report the genes affected by CNV segregating between tame and aggressive 
rat populations, as inferred by HaMMLET. Tables represent GO term enrichment for the “biological 
process” sub-root of the gene ontology Tables are sorted according to the most significant p-value 
corrected for multiple testing. The columns are: Gene symbol, Description, Entrez Gene ID, and 
the Ensembl ID with the leading “ENSRNOG” (Ensembl prefix for Rattus norvegicus) as well as 
the padding zeros removed. 
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G0:0007416: synapse assembly (p = 0.0028) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Nrgl 

neuregulin 1 

112400 

10392 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 


G0:0010646: regulation of cell communication (p = 0.0064) 


Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Myo9b 

myosin IXb 

25486 

16256 

Gria4 

glutamate receptor, ionotropic, AMPA 4 

29629 

6957 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Salll 

sal-like 1 (Drosophila) 

307740 

- 

Cythl 

cytohesin 1 

116691 

43381 

Epha4 

Eph receptor A4 

316539 

13213 

Ppp3cb 

protein phosphatase 3, catalytic subunit, beta isozyme 

24675 

7757 

Scoc 

short coiled-coil protein 

364981 

3853 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Nrgl 

neuregulin 1 

112400 

10392 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Rheb 

Ras homolog enriched in brain 

26954 

- 

Uaca 

uveal autoantigen with coiled-coil domains and ankyrin repeats 

315732 

- 


G0:0051674: localization of cell (p = 0.0064) 


Cdh2 

cadherin 2 

83501 

15602 

Ddr2 

discoidin domain receptor tyrosine kinase 2 

685781 

2881 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Myo9b 

myosin IXb 

25486 

16256 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

Vcl 

vinculin 

305679 

10765 
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G0:0030010: establishment of cell polarity (p = 0.0064) 

Sdccag8 

serologically defined colon cancer antigen 8 

305002 

4181 

Ephbl 

Eph receptor B1 

24338 

7865 

Cythl 

cytohesin 1 

116691 

43381 

Myo9b 

myosin IXb 

25486 

16256 

G0:0050808: synapse organization (p = 0.0064) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Nrgl 

neuregulin 1 

112400 

10392 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

G0:0051969: regulation of transmission of nerve impulse (p = 0.0064) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Cdh2 

cadherin 2 

83501 

15602 

Gria4 

glutamate receptor, ionotropic, AMPA 4 

29629 

6957 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Rheb 

Ras homolog enriched in brain 

26954 

- 

G0:0016477: cell migration (p = 0.0064) 

Cdh2 

cadherin 2 

83501 

15602 

Ddr2 

discoidin domain receptor tyrosine kinase 2 

685781 

2881 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Myo9b 

myosin IXb 

25486 

16256 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

Vcl 

vinculin 

305679 

10765 
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G0:0031644: regulation of neurological system process (p = 0.0064) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Cdh2 

cadherin 2 

83501 

15602 

Gria4 

glutamate receptor, ionotropic, AMPA 4 

29629 

6957 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Rheb 

Ras homolog enriched in brain 

26954 

- 

G0:0046777: protein autophosphorylation (p = 0.0064) 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ddr2 

discoidin domain receptor tyrosine kinase 2 

685781 

2881 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

G0:0048870: cell motility (p = 0.0064) 

Cdh2 

cadherin 2 

83501 

15602 

Ddr2 

discoidin domain receptor tyrosine kinase 2 

685781 

2881 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Myo9b 

myosin IXb 

25486 

16256 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

Vcl 

vinculin 

305679 

10765 

G0:0021955: central nervous system neuron axonogenesis (p = 0.0064) 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

G0:0050804: regulation of synaptic transmission (p = 0.0068) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Gria4 

glutamate receptor, ionotropic, AMPA 4 

29629 

6957 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Rheb 

Ras homolog enriched in brain 

26954 

- 
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G0:0048813: dendrite morphogenesis (p = 0.0068) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 


G0:0019226: transmission of nerve impulse (p = 0.0068) 


Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Gria4 

glutamate receptor, ionotropic, AMPA 4 

29629 

6957 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Rheb 

Ras homolog enriched in brain 

26954 

- 


G0:0055083: monovalent inorganic anion homeostasis (p = 0.0068) 


Cacnala 

Ptk2 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

PTK2 protein tyrosine kinase 2 

25398 

25614 

2559 

7916 

G0:0055064: chloride ion homeostasis (p = 0.0068) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

G0:0030644: cellular chloride ion homeostasis (p = 0.0068) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

G0:0030320: cellular monovalent inorganic anion homeostasis (p = 0.0068) 

Stab2 

stabilin 2 

282580 

- 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Cythl 

cytohesin 1 

116691 

43381 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Vcl 

vinculin 

305679 

10765 

Cdhl9 

cadherin 19, type 2 

360835 

29841 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 
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G0:0007155: cell adhesion (p = 0.0068) 

Stab2 

stabilin 2 

282580 

- 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Cythl 

cytohesin 1 

116691 

43381 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Vcl 

vinculin 

305679 

10765 

Cdhl9 

cadherin 19, type 2 

360835 

29841 

G0:0006928: cellular component movement (p = 0.0068) 

Cdh2 

cadherin 2 

83501 

15602 

Ddr2 

discoidin domain receptor tyrosine kinase 2 

685781 

2881 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Myo9b 

myosin IXb 

25486 

16256 

Elmol 

engulfment and cell motility 1 

361251 

18726 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

Vcl 

vinculin 

305679 

10765 

G0:0035637: multicellular organismal signaling (p = 0.0089) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Gria4 

glutamate receptor, ionotropic, AMPA 4 

29629 

6957 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Rheb 

Ras homolog enriched in brain 

26954 

- 
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G0:0007268: synaptic transmission (p = 0.0089) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Ephbl 

Eph receptor B1 

24338 

7865 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Gria4 

glutamate receptor, ionotropic, AMPA 4 

29629 

6957 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Rheb 

Ras homolog enriched in brain 

26954 

- 

G0:0007267: cell-cell signaling (p = 0.0089) 


Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Ephbl 

Eph receptor B1 

24338 

7865 

Ppp3cb 

protein phosphatase 3, catalytic subunit, beta isozyme 

24675 

7757 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Gria4 

glutamate receptor, ionotropic, AMPA 4 

29629 

6957 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Rheb 

Ras homolog enriched in brain 

26954 

- 

Salll 

sal-like 1 (Drosophila) 

307740 

- 


G0:0048858: cell projection morphogenesis (p = 0.0105) 


Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Myo9b 

myosin IXb 

25486 

16256 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 
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G0:0000902: cell morphogenesis (p = 0.0105) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Myo9b 

myosin IXb 

25486 

16256 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

Salll 

sal-like 1 (Drosophila) 

307740 

- 

G0:0007163: establishment or maintenance of cell polarity (p = 0.0105) 


Sdccag8 

serologically defined colon cancer antigen 8 

305002 

4181 

Ephbl 

Eph receptor B1 

24338 

7865 

Cythl 

cytohesin 1 

116691 

43381 

Myo9b 

myosin IXb 

25486 

16256 

G0:0032990: cell part morphogenesis (p = 0.0114) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Myo9b 

myosin IXb 

25486 

16256 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

G0:0040011: locomotion (p = 0.0114) 

Cdh2 

cadherin 2 

83501 

15602 

Ddr2 

discoidin domain receptor tyrosine kinase 2 

685781 

2881 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Myo9b 

myosin IXb 

25486 

16256 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

Vcl 

vinculin 

305679 

10765 
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G0:0050770: regulation of axonogenesis (p = 0.0114) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Epha4 

Eph receptor A4 

316539 

13213 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 


G0:0023051: regulation of signaling (p = 0.0114) 


Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Myo9b 

myosin IXb 

25486 

16256 

Gria4 

glutamate receptor, ionotropic, AMPA 4 

29629 

6957 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Salll 

sal-like 1 (Drosophila) 

307740 

- 

Cythl 

cytohesin 1 

116691 

43381 

Epha4 

Eph receptor A4 

316539 

13213 

Ppp3cb 

protein phosphatase 3, catalytic subunit, beta isozyme 

24675 

7757 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Nrgl 

neuregulin 1 

112400 

10392 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Rheb 

Ras homolog enriched in brain 

26954 

- 

Uaca 

uveal autoantigen with coiled-coil domains and ankyrin repeats 

315732 

- 


G0:0048812: neuron projection morphogenesis (p = 0.0129) 


Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 
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G0:0051179: localization (p = 0.0129) 


Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ddr2 

discoidin domain receptor tyrosine kinase 2 

685781 

2881 

Ephbl 

Eph receptor B1 

24338 

7865 

Sec24c 

SEC24 family, member C (S. cerevisiae) 

685144 

9042 

Brca2 

breast cancer 2 

360254 

1111 

Myo9b 

myosin IXb 

25486 

16256 

Elmol 

engulfment and cell motility 1 

361251 

18726 

Camk2g 

calcium/calmodulin-dependent protein kinase II gamma 

171140 

9783 

Plau 

plasminogen activator, urokinase 

25619 

10516 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

Vcl 

vinculin 

305679 

10765 

Tbcldl 

TBC1 (tre-2/USP6, BUB2, cdcl6) domain family, member 1 

360937 

2180 

Kcnt2 

potassium channel, subfamily T, member 2 

304827 

13312 

Stab2 

stabilin 2 

282580 

- 

Itlnl 

intelectin 1 (galactofuranose binding) 

498284 

4678 

Ppp3cb 

protein phosphatase 3, catalytic subunit, beta isozyme 

24675 

7757 

Epha4 

Eph receptor A4 

316539 

13213 

Cythl 

cytohesin 1 

116691 

43381 

Ccdc91 

coiled-coil domain containing 91 

312863 

- 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Nrgl 

neuregulin 1 

112400 

10392 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Uaca 

uveal autoantigen with coiled-coil domains and ankyrin repeats 

315732 

- 

G0:0007265: Ras protein signal transduction (p = 0.0142) 

Elmol 

engulfment and cell motility 1 

361251 

18726 

Cdh2 

cadherin 2 

83501 

15602 

Epha4 

Eph receptor A4 

316539 

13213 

Nrgl 

neuregulin 1 

112400 

10392 

Cythl 

cytohesin 1 

116691 

43381 

Myo9b 

myosin IXb 

25486 

16256 
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G0:0032989: cellular component morphogenesis (p = 0.0142) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Myo9b 

myosin IXb 

25486 

16256 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

Salll 

sal-like 1 (Drosophila) 

307740 

- 

G0:0030002: cellular anion homeostasis (p = 0.0177) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

G0:0018212: peptidyl-tyrosine modification (p = 0.0187) 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ddr2 

discoidin domain receptor tyrosine kinase 2 

685781 

2881 

Epha4 

Eph receptor A4 

316539 

13213 

Nrgl 

neuregulin 1 

112400 

10392 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

G0:0018108: peptidyl-tyrosine phosphorylation (p = 0.0187) 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ddr2 

discoidin domain receptor tyrosine kinase 2 

685781 

2881 

Epha4 

Eph receptor A4 

316539 

13213 

Nrgl 

neuregulin 1 

112400 

10392 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

G0:0009894: regulation of catabolic process (p = 0.0195) 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Epha4 

Eph receptor A4 

316539 

13213 

Nrgl 

neuregulin 1 

112400 

10392 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Myo9b 

myosin IXb 

25486 

16256 

Uaca 

uveal autoantigen with coiled-coil domains and ankyrin repeats 

315732 

- 

Scoc 

short coiled-coil protein 

364981 

3853 
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G0:0030182: neuron differentiation (p = 0.0195) 

Cacnala 

calcium channel, voltage-dependent, P/Q type, alpha 1A subunit 

25398 

2559 

Cdh2 

cadherin 2 

83501 

15602 

Ephbl 

Eph receptor B1 

24338 

7865 

Epha4 

Eph receptor A4 

316539 

13213 

Egfr 

epidermal growth factor receptor 

24329 

4332 

Ptk2 

PTK2 protein tyrosine kinase 2 

25614 

7916 

Nrgl 

neuregulin 1 

112400 

10392 

Cdk5rapl 

CDK5 regulatory subunit associated protein 1 

252827 

15696 

Dcdc2 

doublecortin domain containing 2 

291130 

17511 

Salll 

sal-like 1 (Drosophila) 

307740 

- 
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